英文:
How to create a regex that will substitute whatever that is following an email signature with an empty string?
问题
我理解了。以下是你想要的翻译:
我正在尝试创建一个正则表达式,它将用空字符串替换跟在电子邮件签名后面(以人名开头)的任何内容。
我使用个人身份信息(PII)数据作为输入,所以一般的示例可以是:
1) `xxx 问候,[姓名] xxx`
2) `xxx 问候。[姓名] xxx`
3) `xxx 致以最美好的祝愿,[姓名] xxx`
4) `xxx
问候,
致以最美好的祝愿,
[姓名]`
5) `xxx
致以最美好的祝愿,
[姓名]`
4) `xxx
问候,
致以最美好的祝愿
[姓名]`
我希望移除`[姓名]`模式以及其后的所有内容,但当它之前有一个或多个签名模式(例如:问候,或问候。或问候)时,才应该移除它。
当然,`最美好的祝愿`和`问候`不是唯一的结束示例,每个示例都可以有几个标点后缀。
我该如何做到这一点?
谢谢!
以下是我目前的代码:
```python
text = 'bla bla 真诚地,最美好的祝愿,[姓名] bli bli.'
text = re.sub(r'((?:感谢(?:再次)?(?:\.|,|!)|提前致谢(?:\.|,|!)|多谢(?:\.|,|!)|(?:最美好的祝愿,)|最美好的问候(?:\.|,|!)?|致以最美好的问候|一切顺利(?:\.|,|!)?|(?:真诚地(?:,|\.))|干杯(?:\.|,|!)|祝您愉快(?:\.|,|!)?)\s*)+(\[姓名\].*)', r'', text, flags=re.DOTALL|re.UNICODE|re.IGNORECASE)
错误的输出目前是:
bla bla[姓名] bli bli
期望的输出是:
bla bla 真诚地,最美好的祝愿,
英文:
I'm trying to create a regex that will substitute whatever that is following an email signature (begining with the person's name) with an empty string.
I use a PII data as my input, so general examples can be:
xxx best, [NAME] xxx
xxx best. [NAME] xxx
xxx best, regards, [NAME] xxx
xxx
best,
regards,
[NAME]xxx
regards,
[NAME]xxx
best,
regards
[NAME]
I wish my to remove the [NAME]
pattern and all that is after it, but of course it should only remove it if before it there's one (or more) of the signature patterns (for example: best, or best. or best or regards).
Of course that best
and regards
are not the only closing examples, and eac of these can have a few of punctuation postfixes.
How can I do that?
Thanks!
Here is what I have so far:
text = 'bla bla sincerely, best, [NAME] bli bli.'
text = re.sub(r'((?:thanks(?: again)?(?:\.|,|!)|thanks in advance(?:\.|,|!)|many thanks(?:\.|,|!)|(?:best,)|best regards(?:\.|,|!)?|with best regards|All the best(?:\.|,|!)?|(?:sincerely(?:,|\.))|cheers(?:\.|,|!)|have a nice day(?:\.|,|!)?)\s*)+(\[NAME\].*)', r'', text, flags=re.DOTALL|re.UNICODE|re.IGNORECASE)
The wrong output is currently:
bla bla[NAME] bli bli
The desired output is:
bla bla sincerely, best,
答案1
得分: 1
以下是翻译好的代码部分:
首先,将短语提取到专用列表中;这可以确保可读性和可维护性:
phrases = [
'thanks',
'thanks again',
'thanks in advance',
'many thanks',
'best',
'best regards',
'with best regards',
'all the best',
'sincerely',
'cheers',
'have a nice day'
]
...然后从中构建正则表达式:
import re
escaped_phrases = '|'.join(re.escape(phrase) for phrase in phrases)
regex = re.compile(
fr'((?:(?:{escaped_phrases})[.,!]?\s*)+)\[NAME].*',
flags=re.DOTALL | re.UNICODE | re.IGNORECASE
)
解释:
( # 匹配一个捕获组,包括
(?: # 非捕获组
(?:phrases) # 包含短语之一,
[.,!]? # 后面可以跟着'.'、','或'!',可选,
\s* # 然后0+个空白字符,
)+ # 1个或多个
) # 然后
\[NAME].* # '[NAME]' 文本和之后的任何内容。
由于我们要匹配短语和名字,所以我们需要使用''返回前者:
def remove_name(text):
return regex.sub(r'', text)
尝试一下:
text = 'bla bla sincerely, best, [NAME] bli bli.'
print(regex)
'''
re.compile(
'((?:(?:thanks|thanks again|...)[.,!]?\\s*)+)\\[NAME].*',
re.IGNORECASE | re.UNICODE | re.DOTALL
)
'''
text = regex.sub(r'', text)
print(text) # 'bla bla sincerely, best, '
请注意,以上是代码的翻译,不包括注释或解释性文本。
英文:
First, extract the phrases to a dedicated list; this ensures readability and maintainability:
phrases = [
'thanks',
'thanks again',
'thanks in advance',
'many thanks',
'best',
'best regards',
'with best regards',
'all the best',
'sincerely',
'cheers',
'have a nice day'
]
...then construct the regex from that:
import re
escaped_phrases = '|'.join(re.escape(phrase) for phrase in phrases)
regex = re.compile(
fr'((?:(?:{escaped_phrases})[.,!]?\s*)+)\[NAME].*',
flags = re.DOTALL | re.UNICODE | re.IGNORECASE
)
Explanation:
( # Match a capturing group consisting of
(?: # non-capturing groups
(?:phrases) # that has one of the phrases,
[.,!]? # followed by '.', ',' or '!', optionally,
\s* # then 0+ whitespace characters,
)+ # 1+
) # then
\[NAME].* # '[NAME]' literally and anything after that.
Since we're matching both the phrases and the name, we need to give the former back with \1
:
def remove_name(text):
return regex.sub(r'', text)
Try it:
text = 'bla bla sincerely, best, [NAME] bli bli.'
print(regex)
'''
re.compile(
'((?:(?:thanks|thanks again|...)[.,!]?\\s*)+)\\[NAME].*',
re.IGNORECASE | re.UNICODE | re.DOTALL
)
'''
text = regex.sub(r'', text)
print(text) # 'bla bla sincerely, best, '
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论