英文:
Keep original string values after pandas.series.str.extract() if the regex doesn't match
问题
以下是您要翻译的内容:
我正在尝试从字符串中提取电子邮件地址,并希望确保如果原始值已经按我预期的方式格式化,它不会被更改为NaN,而是保持原样。
示例输入
<class 'pandas.core.series.Series'>
1 <doe.b.john@gmail.com>
2 <doe.c.jane@gmail.com>
3 person.anonymous@hotmail.com
4 dent.arthur@space.com
我正在使用
# curr_emails is <class 'pandas.core.series.Series'>
curr_emails = curr_emails.str.extract(r"<([^<>]+)>").squeeze()` # 正则表达式提取在<和>之间的文本
我收到的结果是
1 doe.b.john@gmail.com
2 doe.c.jane@gmail.com
3 NaN
4 Nan
但我希望的结果是
1 doe.b.john@gmail.com
2 doe.c.jane@gmail.com
3 person.anonymous@hotmail.com
4 dent.arthur@space.com
类似的问题在这里有人提出,但我似乎无法使其适用于我的当前方法。
英文:
I am trying to extract emails from strings, and want to make sure that if the original value is formatted how I expect already, that it is not changed to Nan and instead is kept as is.
Example input
<class 'pandas.core.series.Series'>
1 <doe.b.john@gmail.com>
2 <doe.c.jane@gmail.com>
3 person.anonymous@hotmail.com
4 dent.arthur@space.com
I am using
# curr_emails is <class 'pandas.core.series.Series'>
curr_emails = curr_emails.str.extract(r"<([^<>]+)>").squeeze()` # regex extracts text between < >
I receive back
1 doe.b.john@gmail.com
2 doe.c.jane@gmail.com
3 NaN
4 Nan
But I instead would like
1 doe.b.john@gmail.com
2 doe.c.jane@gmail.com
3 person.anonymous@hotmail.com
4 dent.arthur@space.com
A similar question is posted here, but I could not seem to make it work with my current approach.
答案1
得分: 1
如果没有<>
模式,您可以用原始值填充它们。或者如果<>
位于电子邮件的开头和结尾,您可以将它们删除。
curr_emails = (curr_emails.str.extract(r"<([^<>]+)>").squeeze()
.fillna(curr_emails))
# 或者
curr_emails = curr_emails.str.strip(r'<>')
英文:
If there is no <>
pattern, you can fill them by the original value. Or if the <>
is at the beginning and the end of email, you can just strip them
curr_emails = (curr_emails.str.extract(r"<([^<>]+)>").squeeze()
.fillna(curr_emails))
# or
curr_emails = curr_emails.str.strip(r'<>')
答案2
得分: 0
尝试使用 str.replace
而不是 str.extract
来将字符串开头的 <
或字符串结尾的 >
替换为 ''
curr_emails.str.replace('^<|>$', '''', regex=True)
0 doe.b.john@gmail.com
1 doe.c.jane@gmail.com
2 person.anonymous@hotmail.com
3 dent.arthur@space.com
英文:
Try using str.replace
instead of str.extract
to replace <
at the start of a string or >
at the end of a string with ''
curr_emails.str.replace('^<|>$', '', regex=True)
0 doe.b.john@gmail.com
1 doe.c.jane@gmail.com
2 person.anonymous@hotmail.com
3 dent.arthur@space.com
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论