Pandas正则提取到新列,如果满足条件

huangapple go评论80阅读模式
英文:

Pandas Regex Extract to New Column If Condition

问题

我正在尝试运行一个脚本,该脚本将查看特定列("Name")中单元格中的值,并检查是否包含特定字符串。如果单元格包含该字符串,脚本将查看"Value"单元格中的值,并将该值复制到"New_Value"列中,并附加一个字符串,或者使用正则表达式提取值的一部分,并将一个字符串附加到新列中。

问题出在正则表达式提取电话号码上。如果我删除正则表达式部分,脚本将把单元格的值复制到New_Value单元格中,并附加[PHONE]字符串。但是,当我包含正则表达式提取时,它不会复制任何内容,就好像它没有看到与正则表达式的匹配。奇怪的是,如果我修改代码,让正则表达式只返回基于匹配的True/False,它会返回一个匹配的True。

我是否在某种程度上搞错了正则表达式的提取?感谢您的帮助。谢谢。

我的代码

import pandas as pd
import numpy as np
import re

df = pd.read_csv("path_to_file")

df.loc[df["Name"].str.contains("Number"), "New_Value"] = (
    df["Value"].str.extract(r"(\d{10,13})") + "[PHONE]"
)
df.loc[df["Name"].str.contains("Animal"), "New_Value"] = (
    df["Value"] + "[PET]"
)
df.loc[df["Name"].str.contains("Company"), "New_Value"] = (
    df["Value"] + "[COMPANY]"
)
df.loc[df["Name"].str.contains("Contact"), "New_Value"] = (
    df["Value"] + "[NAME]"
)

示例数据

| Name     | Value             |
| -------- | ---------------   |
| Number   | +639892943187     |
| Company  | Ebert_LLC         |
| Animal   | Snake             |
| Contact  | Michal_Ashburner  |
| Number   | 668949201578      |
| Company  | Hoppe_LLC         |
| Animal   | European_badger   |
| Contact  | Marlee_Gofford    |

期望结果

| Name     | Value             | New_Value              |
| -------- | ---------------   | ----------------------|
| Number   | +639892943187     | 639892943187[PHONE]    |
| Company  | Ebert_LLC         | Ebert_LLC[COMPANY]     |
| Animal   | Snake             | Snake[PET]             |
| Contact  | Michal_Ashburner  | Michal_Ashburner[NAME] |
| Number   | 668949201578      | 668949201578[PHONE]    |
| Company  | Hoppe_LLC         | Hoppe_LLC[COMPANY]     |
| Animal   | European_badger   | European_badger[PET]   |
| Contact  | Marlee_Gofford    | Marlee_Gofford[NAME]   |
英文:

I am trying to run a script that will look at the value in a cell along a specific column ("Name") and check to see if it contains a particular string. If the cell contains the string, the script will look at the value contained in the "Value" cell and either copy the value into the "New_Value" column and append a string or regex extract a portion of the value and append a string into the new column.

The issue is with the regex extraction of the phone numbers. If I remove the regex portion, the script will copy the cell value into the New_Value cell, with the appended [PHONE] string. When I include the regex extraction, it doesn't copy anything over, like it didn't see a match to the regex. The weird thing is that if I alter the code so the regex just returns a True/False based on a Match, it returns a True for a match.

Am I somehow screwing up regex extraction? Any help is appreciated, thank you.

My code

import pandas as pd
import numpy as np
import re

df = pd.read_csv("path_to_file")

df.loc[df["Name"].str.contains("Number"), "New_Value"] = (
    df["Value"].str.extract(r"(\d{10,13})") + "[PHONE]"
)
df.loc[df["Name"].str.contains("Animal"), "New_Value"] = (
    df["Value"] + "[PET]"
)
df.loc[df["Name"].str.contains("Company"), "New_Value"] = (
    df["Value"] + "[COMPANY]"
)
df.loc[df["Name"].str.contains("Contact"), "New_Value"] = (
    df["Value"] + "[NAME]"
)

Sample Data

| Name     | Value             |
| -------- | ---------------   |
| Number   | +639892943187     |
| Company  | Ebert_LLC         |
| Animal   | Snake             |
| Contact  | Michal_Ashburner  |
| Number   | 668949201578      |
| Company  | Hoppe_LLC         |
| Animal   | European_badger   |
| Contact  | Marlee_Gofford    |

Desired result

| Name     | Value             | New_Value              |
| -------- | ---------------   | ----------------------
| Number   | +639892943187     | 639892943187[PHONE]    |
| Company  | Ebert_LLC         | Ebert_LLC[COMPANY]     |
| Animal   | Snake             | Snake[PET]             |
| Contact  | Michal_Ashburner  | Michal_Ashburner[NAME] |
| Number   | 668949201578      | 668949201578[PHONE]    |
| Company  | Hoppe_LLC         | Hoppe_LLC[COMPANY]     |
| Animal   | European_badger   | European_badger[PET    |
| Contact  | Marlee_Gofford    | Marlee_Gofford[NAME]   |

答案1

得分: 1

这应该可以完成任务:

df.loc[df.Name == 'Number', 'New_Value'] = df.Value.replace('+', '', regex=False) + '[PHONE]'
df.loc[df.Name == 'Animal', 'New_Value'] = df.Value + '[PET]'
df.loc[df.Name == 'Company', 'New_Value'] = df.Value + '[COMPANY]'
df.loc[df.Name == 'Contact', 'New_Value'] = df.Value + '[NAME]'

请注意,.loc() 可以用于在 DataFrame 中创建新的列/行。

英文:

This should do the job:

df.loc[df.Name == 'Number', 'New_Value'] = df.Value.replace('+', '', regex=False) + '[PHONE]'
df.loc[df.Name == 'Animal', 'New_Value'] = df.Value + '[PET]'
df.loc[df.Name == 'Company', 'New_Value'] = df.Value + '[COMPANY]'
df.loc[df.Name == 'Contact', 'New_Value'] = df.Value + '[NAME]'

Note that the .loc() can be utilised to create new columns/rows in a DataFrame.

huangapple
  • 本文由 发表于 2023年8月9日 00:28:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76861506.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定