英文:
Pandas Regex Extract to New Column If Condition
问题
我正在尝试运行一个脚本,该脚本将查看特定列("Name")中单元格中的值,并检查是否包含特定字符串。如果单元格包含该字符串,脚本将查看"Value"单元格中的值,并将该值复制到"New_Value"列中,并附加一个字符串,或者使用正则表达式提取值的一部分,并将一个字符串附加到新列中。
问题出在正则表达式提取电话号码上。如果我删除正则表达式部分,脚本将把单元格的值复制到New_Value单元格中,并附加[PHONE]字符串。但是,当我包含正则表达式提取时,它不会复制任何内容,就好像它没有看到与正则表达式的匹配。奇怪的是,如果我修改代码,让正则表达式只返回基于匹配的True/False,它会返回一个匹配的True。
我是否在某种程度上搞错了正则表达式的提取?感谢您的帮助。谢谢。
我的代码
import pandas as pd
import numpy as np
import re
df = pd.read_csv("path_to_file")
df.loc[df["Name"].str.contains("Number"), "New_Value"] = (
df["Value"].str.extract(r"(\d{10,13})") + "[PHONE]"
)
df.loc[df["Name"].str.contains("Animal"), "New_Value"] = (
df["Value"] + "[PET]"
)
df.loc[df["Name"].str.contains("Company"), "New_Value"] = (
df["Value"] + "[COMPANY]"
)
df.loc[df["Name"].str.contains("Contact"), "New_Value"] = (
df["Value"] + "[NAME]"
)
示例数据
| Name | Value |
| -------- | --------------- |
| Number | +639892943187 |
| Company | Ebert_LLC |
| Animal | Snake |
| Contact | Michal_Ashburner |
| Number | 668949201578 |
| Company | Hoppe_LLC |
| Animal | European_badger |
| Contact | Marlee_Gofford |
期望结果
| Name | Value | New_Value |
| -------- | --------------- | ----------------------|
| Number | +639892943187 | 639892943187[PHONE] |
| Company | Ebert_LLC | Ebert_LLC[COMPANY] |
| Animal | Snake | Snake[PET] |
| Contact | Michal_Ashburner | Michal_Ashburner[NAME] |
| Number | 668949201578 | 668949201578[PHONE] |
| Company | Hoppe_LLC | Hoppe_LLC[COMPANY] |
| Animal | European_badger | European_badger[PET] |
| Contact | Marlee_Gofford | Marlee_Gofford[NAME] |
英文:
I am trying to run a script that will look at the value in a cell along a specific column ("Name") and check to see if it contains a particular string. If the cell contains the string, the script will look at the value contained in the "Value" cell and either copy the value into the "New_Value" column and append a string or regex extract a portion of the value and append a string into the new column.
The issue is with the regex extraction of the phone numbers. If I remove the regex portion, the script will copy the cell value into the New_Value cell, with the appended [PHONE] string. When I include the regex extraction, it doesn't copy anything over, like it didn't see a match to the regex. The weird thing is that if I alter the code so the regex just returns a True/False based on a Match, it returns a True for a match.
Am I somehow screwing up regex extraction? Any help is appreciated, thank you.
My code
import pandas as pd
import numpy as np
import re
df = pd.read_csv("path_to_file")
df.loc[df["Name"].str.contains("Number"), "New_Value"] = (
df["Value"].str.extract(r"(\d{10,13})") + "[PHONE]"
)
df.loc[df["Name"].str.contains("Animal"), "New_Value"] = (
df["Value"] + "[PET]"
)
df.loc[df["Name"].str.contains("Company"), "New_Value"] = (
df["Value"] + "[COMPANY]"
)
df.loc[df["Name"].str.contains("Contact"), "New_Value"] = (
df["Value"] + "[NAME]"
)
Sample Data
| Name | Value |
| -------- | --------------- |
| Number | +639892943187 |
| Company | Ebert_LLC |
| Animal | Snake |
| Contact | Michal_Ashburner |
| Number | 668949201578 |
| Company | Hoppe_LLC |
| Animal | European_badger |
| Contact | Marlee_Gofford |
Desired result
| Name | Value | New_Value |
| -------- | --------------- | ----------------------
| Number | +639892943187 | 639892943187[PHONE] |
| Company | Ebert_LLC | Ebert_LLC[COMPANY] |
| Animal | Snake | Snake[PET] |
| Contact | Michal_Ashburner | Michal_Ashburner[NAME] |
| Number | 668949201578 | 668949201578[PHONE] |
| Company | Hoppe_LLC | Hoppe_LLC[COMPANY] |
| Animal | European_badger | European_badger[PET |
| Contact | Marlee_Gofford | Marlee_Gofford[NAME] |
答案1
得分: 1
这应该可以完成任务:
df.loc[df.Name == 'Number', 'New_Value'] = df.Value.replace('+', '', regex=False) + '[PHONE]'
df.loc[df.Name == 'Animal', 'New_Value'] = df.Value + '[PET]'
df.loc[df.Name == 'Company', 'New_Value'] = df.Value + '[COMPANY]'
df.loc[df.Name == 'Contact', 'New_Value'] = df.Value + '[NAME]'
请注意,.loc()
可以用于在 DataFrame 中创建新的列/行。
英文:
This should do the job:
df.loc[df.Name == 'Number', 'New_Value'] = df.Value.replace('+', '', regex=False) + '[PHONE]'
df.loc[df.Name == 'Animal', 'New_Value'] = df.Value + '[PET]'
df.loc[df.Name == 'Company', 'New_Value'] = df.Value + '[COMPANY]'
df.loc[df.Name == 'Contact', 'New_Value'] = df.Value + '[NAME]'
Note that the .loc()
can be utilised to create new columns/rows in a DataFrame.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论