保留 pandas.series.str.extract() 之后的原始字符串值,如果正则表达式不匹配。

huangapple go评论56阅读模式
英文:

Keep original string values after pandas.series.str.extract() if the regex doesn't match

问题

以下是您要翻译的内容:

我正在尝试从字符串中提取电子邮件地址,并希望确保如果原始值已经按我预期的方式格式化,它不会被更改为NaN,而是保持原样。

示例输入

<class 'pandas.core.series.Series'>
1    <doe.b.john@gmail.com>
2    <doe.c.jane@gmail.com>
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

我正在使用

# curr_emails is <class 'pandas.core.series.Series'>
curr_emails = curr_emails.str.extract(r"<([^<>]+)>").squeeze()` # 正则表达式提取在<和>之间的文本

我收到的结果是

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    NaN
4    Nan

但我希望的结果是

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

类似的问题在这里有人提出,但我似乎无法使其适用于我的当前方法。

英文:

I am trying to extract emails from strings, and want to make sure that if the original value is formatted how I expect already, that it is not changed to Nan and instead is kept as is.

Example input

<class 'pandas.core.series.Series'>
1    <doe.b.john@gmail.com>
2    <doe.c.jane@gmail.com>
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

I am using

# curr_emails is <class 'pandas.core.series.Series'>
curr_emails = curr_emails.str.extract(r"<([^<>]+)>").squeeze()` # regex extracts text between < >

I receive back

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    NaN
4    Nan

But I instead would like

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

A similar question is posted here, but I could not seem to make it work with my current approach.

答案1

得分: 1

如果没有<>模式,您可以用原始值填充它们。或者如果<>位于电子邮件的开头和结尾,您可以将它们删除。

curr_emails = (curr_emails.str.extract(r"<([^<>]+)>").squeeze()
               .fillna(curr_emails))
# 或者
curr_emails = curr_emails.str.strip(r'<>')
英文:

If there is no <> pattern, you can fill them by the original value. Or if the <> is at the beginning and the end of email, you can just strip them

curr_emails = (curr_emails.str.extract(r"<([^<>]+)>").squeeze()
               .fillna(curr_emails))
# or
curr_emails = curr_emails.str.strip(r'<>')

答案2

得分: 0

尝试使用 str.replace 而不是 str.extract 来将字符串开头的 < 或字符串结尾的 > 替换为 ''

curr_emails.str.replace('^<|>$', '''', regex=True)

0            doe.b.john@gmail.com
1            doe.c.jane@gmail.com
2    person.anonymous@hotmail.com
3           dent.arthur@space.com
英文:

Try using str.replace instead of str.extract to replace < at the start of a string or > at the end of a string with ''

curr_emails.str.replace('^<|>$', '', regex=True)

0            doe.b.john@gmail.com
1            doe.c.jane@gmail.com
2    person.anonymous@hotmail.com
3           dent.arthur@space.com

huangapple
  • 本文由 发表于 2023年5月31日 22:51:52
  • 转载请务必保留本文链接:https://go.coder-hub.com/76374785.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定