2023年5月31日 22:51:52go评论74阅读模式

英文:

Keep original string values after pandas.series.str.extract() if the regex doesn't match

问题

以下是您要翻译的内容：

我正在尝试从字符串中提取电子邮件地址，并希望确保如果原始值已经按我预期的方式格式化，它不会被更改为NaN，而是保持原样。

示例输入

&lt;class &#39;pandas.core.series.Series&#39;&gt;
1    &lt;doe.b.john@gmail.com&gt;
2    &lt;doe.c.jane@gmail.com&gt;
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

我正在使用

# curr_emails is &lt;class &#39;pandas.core.series.Series&#39;&gt;
curr_emails = curr_emails.str.extract(r&quot;&lt;([^&lt;&gt;]+)&gt;&quot;).squeeze()` # 正则表达式提取在&lt;和&gt;之间的文本

我收到的结果是

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    NaN
4    Nan

但我希望的结果是

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

类似的问题在这里有人提出，但我似乎无法使其适用于我的当前方法。

英文:

I am trying to extract emails from strings, and want to make sure that if the original value is formatted how I expect already, that it is not changed to Nan and instead is kept as is.

Example input

&lt;class &#39;pandas.core.series.Series&#39;&gt;
1    &lt;doe.b.john@gmail.com&gt;
2    &lt;doe.c.jane@gmail.com&gt;
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

I am using

# curr_emails is &lt;class &#39;pandas.core.series.Series&#39;&gt;
curr_emails = curr_emails.str.extract(r&quot;&lt;([^&lt;&gt;]+)&gt;&quot;).squeeze()` # regex extracts text between &lt; &gt;

I receive back

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    NaN
4    Nan

But I instead would like

1    doe.b.john@gmail.com
2    doe.c.jane@gmail.com
3    person.anonymous@hotmail.com
4    dent.arthur@space.com

A similar question is posted here, but I could not seem to make it work with my current approach.

答案1

得分: 1

如果没有<>模式，您可以用原始值填充它们。或者如果<>位于电子邮件的开头和结尾，您可以将它们删除。

curr_emails = (curr_emails.str.extract(r"&lt;([^&lt;&gt;]+)&gt;").squeeze()
               .fillna(curr_emails))
# 或者
curr_emails = curr_emails.str.strip(r'&lt;&gt;')

英文:

If there is no <> pattern, you can fill them by the original value. Or if the <> is at the beginning and the end of email, you can just strip them

curr_emails = (curr_emails.str.extract(r&quot;&lt;([^&lt;&gt;]+)&gt;&quot;).squeeze()
               .fillna(curr_emails))
# or
curr_emails = curr_emails.str.strip(r&#39;&lt;&gt;&#39;)

答案2

得分: 0

尝试使用 str.replace 而不是 str.extract 来将字符串开头的 < 或字符串结尾的 > 替换为 ''

curr_emails.str.replace('^&lt;|&gt;$', '&#39;&#39;', regex=True)

0            doe.b.john@gmail.com
1            doe.c.jane@gmail.com
2    person.anonymous@hotmail.com
3           dent.arthur@space.com

英文:

Try using str.replace instead of str.extract to replace < at the start of a string or > at the end of a string with ''

curr_emails.str.replace(&#39;^&lt;|&gt;$&#39;, &#39;&#39;, regex=True)

0            doe.b.john@gmail.com
1            doe.c.jane@gmail.com
2    person.anonymous@hotmail.com
3           dent.arthur@space.com

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

保留 pandas.series.str.extract() 之后的原始字符串值，如果正则表达式不匹配。

问题

答案1

答案2

Python安装在哪里？

Python – 从XML中抓取数据

执行大量HTTP请求，每次异步执行N个。

“RuntimeError: working outside of request context" when using a generator to stream data with Flask

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论