How to create a regex that will substitute whatever that is following an email signature with an empty string?

huangapple go评论70阅读模式
英文:

How to create a regex that will substitute whatever that is following an email signature with an empty string?

问题

我理解了。以下是你想要的翻译:

我正在尝试创建一个正则表达式它将用空字符串替换跟在电子邮件签名后面以人名开头的任何内容

我使用个人身份信息PII数据作为输入所以一般的示例可以是

1) `xxx 问候[姓名] xxx`
2) `xxx 问候[姓名] xxx`
3) `xxx 致以最美好的祝愿[姓名] xxx`
4) `xxx
   问候
   致以最美好的祝愿
   [姓名]`
5) `xxx
   致以最美好的祝愿
   [姓名]`
4) `xxx
   问候
   致以最美好的祝愿
   [姓名]`

我希望移除`[姓名]`模式以及其后的所有内容但当它之前有一个或多个签名模式例如问候或问候或问候才应该移除它

当然,`最美好的祝愿``问候`不是唯一的结束示例每个示例都可以有几个标点后缀

我该如何做到这一点

谢谢

以下是我目前的代码

```python
text = 'bla bla 真诚地,最美好的祝愿,[姓名] bli bli.'
text = re.sub(r'((?:感谢(?:再次)?(?:\.|,|!)|提前致谢(?:\.|,|!)|多谢(?:\.|,|!)|(?:最美好的祝愿,)|最美好的问候(?:\.|,|!)?|致以最美好的问候|一切顺利(?:\.|,|!)?|(?:真诚地(?:,|\.))|干杯(?:\.|,|!)|祝您愉快(?:\.|,|!)?)\s*)+(\[姓名\].*)', r'', text, flags=re.DOTALL|re.UNICODE|re.IGNORECASE)

错误的输出目前是:

bla bla[姓名] bli bli

期望的输出是:

bla bla 真诚地,最美好的祝愿,
英文:

I'm trying to create a regex that will substitute whatever that is following an email signature (begining with the person's name) with an empty string.

I use a PII data as my input, so general examples can be:

  1. xxx best, [NAME] xxx
  2. xxx best. [NAME] xxx
  3. xxx best, regards, [NAME] xxx
  4. xxx
    best,
    regards,
    [NAME]
  5. xxx
    regards,
    [NAME]
  6. xxx
    best,
    regards
    [NAME]

I wish my to remove the [NAME] pattern and all that is after it, but of course it should only remove it if before it there's one (or more) of the signature patterns (for example: best, or best. or best or regards).

Of course that best and regards are not the only closing examples, and eac of these can have a few of punctuation postfixes.

How can I do that?

Thanks!

Here is what I have so far:

text = 'bla bla sincerely, best, [NAME] bli bli.'
text = re.sub(r'((?:thanks(?: again)?(?:\.|,|!)|thanks in advance(?:\.|,|!)|many thanks(?:\.|,|!)|(?:best,)|best regards(?:\.|,|!)?|with best regards|All the best(?:\.|,|!)?|(?:sincerely(?:,|\.))|cheers(?:\.|,|!)|have a nice day(?:\.|,|!)?)\s*)+(\[NAME\].*)', r'', text, flags=re.DOTALL|re.UNICODE|re.IGNORECASE)

The wrong output is currently:

bla bla[NAME] bli bli

The desired output is:

bla bla sincerely, best,

答案1

得分: 1

以下是翻译好的代码部分:

首先将短语提取到专用列表中这可以确保可读性和可维护性

phrases = [
  'thanks',
  'thanks again',
  'thanks in advance',
  'many thanks',
  'best',
  'best regards',
  'with best regards',
  'all the best',
  'sincerely',
  'cheers',
  'have a nice day'
]

...然后从中构建正则表达式

import re

escaped_phrases = '|'.join(re.escape(phrase) for phrase in phrases)
regex = re.compile(
    fr'((?:(?:{escaped_phrases})[.,!]?\s*)+)\[NAME].*',
    flags=re.DOTALL | re.UNICODE | re.IGNORECASE
)

解释

(                # 匹配一个捕获组,包括
  (?:            # 非捕获组
    (?:phrases)  # 包含短语之一,
    [.,!]?       # 后面可以跟着'.'、','或'!',可选,
    \s*          # 然后0+个空白字符,
  )+             # 1个或多个
)                # 然后
\[NAME].*        # '[NAME]' 文本和之后的任何内容。

由于我们要匹配短语和名字所以我们需要使用''返回前者

def remove_name(text):
  return regex.sub(r'', text)

尝试一下

text = 'bla bla sincerely, best, [NAME] bli bli.'

print(regex)

'''
re.compile(
  '((?:(?:thanks|thanks again|...)[.,!]?\\s*)+)\\[NAME].*',
  re.IGNORECASE | re.UNICODE | re.DOTALL
)
'''

text = regex.sub(r'', text)
print(text)  # 'bla bla sincerely, best, '

请注意,以上是代码的翻译,不包括注释或解释性文本。

英文:

First, extract the phrases to a dedicated list; this ensures readability and maintainability:

phrases = [
  'thanks',
  'thanks again',
  'thanks in advance',
  'many thanks',
  'best',
  'best regards',
  'with best regards',
  'all the best',
  'sincerely',
  'cheers',
  'have a nice day'
]

...then construct the regex from that:

import re

escaped_phrases = '|'.join(re.escape(phrase) for phrase in phrases)
regex = re.compile(
	fr'((?:(?:{escaped_phrases})[.,!]?\s*)+)\[NAME].*',
	flags = re.DOTALL | re.UNICODE | re.IGNORECASE
)

Explanation:

(                # Match a capturing group consisting of
  (?:            #    non-capturing groups
    (?:phrases)  #    that has one of the phrases,
    [.,!]?       #    followed by '.', ',' or '!', optionally,
    \s*          #    then 0+ whitespace characters,
  )+             # 1+
)                # then
\[NAME].*        # '[NAME]' literally and anything after that.

Since we're matching both the phrases and the name, we need to give the former back with \1:

def remove_name(text):
  return regex.sub(r'', text)

Try it:

text = 'bla bla sincerely, best, [NAME] bli bli.'

print(regex)

'''
re.compile(
  '((?:(?:thanks|thanks again|...)[.,!]?\\s*)+)\\[NAME].*',
  re.IGNORECASE | re.UNICODE | re.DOTALL
)
'''

text = regex.sub(r'', text)
print(text)  # 'bla bla sincerely, best, '

huangapple
  • 本文由 发表于 2023年5月7日 04:13:41
  • 转载请务必保留本文链接:https://go.coder-hub.com/76190928.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定