2023年8月9日 00:28:12go评论80阅读模式

英文:

Pandas Regex Extract to New Column If Condition

问题

我正在尝试运行一个脚本，该脚本将查看特定列（"Name"）中单元格中的值，并检查是否包含特定字符串。如果单元格包含该字符串，脚本将查看"Value"单元格中的值，并将该值复制到"New_Value"列中，并附加一个字符串，或者使用正则表达式提取值的一部分，并将一个字符串附加到新列中。

问题出在正则表达式提取电话号码上。如果我删除正则表达式部分，脚本将把单元格的值复制到New_Value单元格中，并附加[PHONE]字符串。但是，当我包含正则表达式提取时，它不会复制任何内容，就好像它没有看到与正则表达式的匹配。奇怪的是，如果我修改代码，让正则表达式只返回基于匹配的True/False，它会返回一个匹配的True。

我是否在某种程度上搞错了正则表达式的提取？感谢您的帮助。谢谢。

我的代码

import pandas as pd
import numpy as np
import re

df = pd.read_csv("path_to_file")

df.loc[df["Name"].str.contains("Number"), "New_Value"] = (
    df["Value"].str.extract(r"(\d{10,13})") + "[PHONE]"
)
df.loc[df["Name"].str.contains("Animal"), "New_Value"] = (
    df["Value"] + "[PET]"
)
df.loc[df["Name"].str.contains("Company"), "New_Value"] = (
    df["Value"] + "[COMPANY]"
)
df.loc[df["Name"].str.contains("Contact"), "New_Value"] = (
    df["Value"] + "[NAME]"
)

示例数据

| Name     | Value             |
| -------- | ---------------   |
| Number   | +639892943187     |
| Company  | Ebert_LLC         |
| Animal   | Snake             |
| Contact  | Michal_Ashburner  |
| Number   | 668949201578      |
| Company  | Hoppe_LLC         |
| Animal   | European_badger   |
| Contact  | Marlee_Gofford    |

期望结果

| Name     | Value             | New_Value              |
| -------- | ---------------   | ----------------------|
| Number   | +639892943187     | 639892943187[PHONE]    |
| Company  | Ebert_LLC         | Ebert_LLC[COMPANY]     |
| Animal   | Snake             | Snake[PET]             |
| Contact  | Michal_Ashburner  | Michal_Ashburner[NAME] |
| Number   | 668949201578      | 668949201578[PHONE]    |
| Company  | Hoppe_LLC         | Hoppe_LLC[COMPANY]     |
| Animal   | European_badger   | European_badger[PET]   |
| Contact  | Marlee_Gofford    | Marlee_Gofford[NAME]   |

英文:

I am trying to run a script that will look at the value in a cell along a specific column ("Name") and check to see if it contains a particular string. If the cell contains the string, the script will look at the value contained in the "Value" cell and either copy the value into the "New_Value" column and append a string or regex extract a portion of the value and append a string into the new column.

The issue is with the regex extraction of the phone numbers. If I remove the regex portion, the script will copy the cell value into the New_Value cell, with the appended [PHONE] string. When I include the regex extraction, it doesn't copy anything over, like it didn't see a match to the regex. The weird thing is that if I alter the code so the regex just returns a True/False based on a Match, it returns a True for a match.

Am I somehow screwing up regex extraction? Any help is appreciated, thank you.

My code

import pandas as pd
import numpy as np
import re

df = pd.read_csv(&quot;path_to_file&quot;)

df.loc[df[&quot;Name&quot;].str.contains(&quot;Number&quot;), &quot;New_Value&quot;] = (
    df[&quot;Value&quot;].str.extract(r&quot;(\d{10,13})&quot;) + &quot;[PHONE]&quot;
)
df.loc[df[&quot;Name&quot;].str.contains(&quot;Animal&quot;), &quot;New_Value&quot;] = (
    df[&quot;Value&quot;] + &quot;[PET]&quot;
)
df.loc[df[&quot;Name&quot;].str.contains(&quot;Company&quot;), &quot;New_Value&quot;] = (
    df[&quot;Value&quot;] + &quot;[COMPANY]&quot;
)
df.loc[df[&quot;Name&quot;].str.contains(&quot;Contact&quot;), &quot;New_Value&quot;] = (
    df[&quot;Value&quot;] + &quot;[NAME]&quot;
)

Sample Data

| Name     | Value             |
| -------- | ---------------   |
| Number   | +639892943187     |
| Company  | Ebert_LLC         |
| Animal   | Snake             |
| Contact  | Michal_Ashburner  |
| Number   | 668949201578      |
| Company  | Hoppe_LLC         |
| Animal   | European_badger   |
| Contact  | Marlee_Gofford    |

Desired result

| Name     | Value             | New_Value              |
| -------- | ---------------   | ----------------------
| Number   | +639892943187     | 639892943187[PHONE]    |
| Company  | Ebert_LLC         | Ebert_LLC[COMPANY]     |
| Animal   | Snake             | Snake[PET]             |
| Contact  | Michal_Ashburner  | Michal_Ashburner[NAME] |
| Number   | 668949201578      | 668949201578[PHONE]    |
| Company  | Hoppe_LLC         | Hoppe_LLC[COMPANY]     |
| Animal   | European_badger   | European_badger[PET    |
| Contact  | Marlee_Gofford    | Marlee_Gofford[NAME]   |

答案1

得分: 1

这应该可以完成任务：

df.loc[df.Name == 'Number', 'New_Value'] = df.Value.replace('+', '', regex=False) + '[PHONE]'
df.loc[df.Name == 'Animal', 'New_Value'] = df.Value + '[PET]'
df.loc[df.Name == 'Company', 'New_Value'] = df.Value + '[COMPANY]'
df.loc[df.Name == 'Contact', 'New_Value'] = df.Value + '[NAME]'

请注意，.loc() 可以用于在 DataFrame 中创建新的列/行。

英文:

This should do the job:

df.loc[df.Name == &#39;Number&#39;, &#39;New_Value&#39;] = df.Value.replace(&#39;+&#39;, &#39;&#39;, regex=False) + &#39;[PHONE]&#39;
df.loc[df.Name == &#39;Animal&#39;, &#39;New_Value&#39;] = df.Value + &#39;[PET]&#39;
df.loc[df.Name == &#39;Company&#39;, &#39;New_Value&#39;] = df.Value + &#39;[COMPANY]&#39;
df.loc[df.Name == &#39;Contact&#39;, &#39;New_Value&#39;] = df.Value + &#39;[NAME]&#39;

Note that the .loc() can be utilised to create new columns/rows in a DataFrame.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas正则提取到新列，如果满足条件

问题

答案1

Beautiful Soup – 在提供 `string` 给 `find()` 方法时忽略 `<span>`

Using Python, how to print output string as -> aaa3bb2c1ddddd5 when Input string is aaabbcddddd

在Django REST框架中发布带有字段的数据。

如何在Python中删除特定变量名称？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论