Pandas数据框如何通过比较列A和B的正则表达式输出来删除行?

huangapple go评论72阅读模式
英文:

Pandas dataframe how can I remove rows by comparing regex output of column A and B

问题

我有一个包含两列的数据框,两列都包含字符串。

我想要删除那些列B(Resources)的正则表达式匹配与列A(ServicePlan)中的正则表达式命名捕获组"termduration"不匹配的行。

数据:

                    ServicePlan                         Resources
0  Plan A (CSP COM BAS 1YR ANN)  Resource A (CSP COM BAS 1YR ANN)
1  Plan A (CSP COM BAS 1YR ANN)  Resource B (CSP COM BAS 1YR ANN)
2  Plan A (CSP COM BAS 1YR ANN)  Resource C (CSP COM BAS 6YR ANN)

我尝试了以下方法,但出现了类型错误。我在两个字符串之间比较正则表达式命名捕获组的结果时感到困难。

import pandas as pd
import re

e_name = r'(?P<name>.*)\((?P<product>[A-Z]{3})\s(?P<type>[A-Z]{3})\s?(?P<baseattach>BAS|ADD|ATT|SWS)?\s?(?P<telco_overusage>OVG)?\s?(?P<termduration>[A-Z0-9]{3})?\s?(?P<billing>[A-Z0-9]{3})\)$'
name_re = re.compile(e_name)

data = {'ServicePlan': ["Plan A (CSP COM BAS 1YR ANN)","Plan A (CSP COM BAS 1YR ANN)","Plan A (CSP COM BAS 1YR ANN)"],
        'Resources': ["Resource A (CSP COM BAS 1YR ANN)","Resource B (CSP COM BAS 1YR ANN)","Resource C (CSP COM BAS 6YR ANN)"]}

df = pd.DataFrame(data)
print(df)
df[~(name_re.findall(df['ServicePlan'].astype(str))[0]['termduration']).ne(name_re.findall(df['Resources'].astype(str))[0]['termduration'])]
print(df)

请注意,我只翻译了代码的注释部分,如有其他需要翻译的部分,请提供详细信息。

英文:

I have a dataframe with two columns , both columns contain strings

I want to delete rows where the regex match of column B (Resources) does not match column A (ServicePlan) for a named capture group "termduration" in the regex result.

Data:

                    ServicePlan                         Resources
0  Plan A (CSP COM BAS 1YR ANN)  Resource A (CSP COM BAS 1YR ANN)
1  Plan A (CSP COM BAS 1YR ANN)  Resource B (CSP COM BAS 1YR ANN)
2  Plan A (CSP COM BAS 1YR ANN)  Resource C (CSP COM BAS 6YR ANN)

I tried the following but I get a type error. I am struggling to compare the regex named capture group result between two strings.

import pandas as pd
import re
e_name = r&#39;(?P&lt;name&gt;.*)\((?P&lt;product&gt;[A-Z]{3})\s(?P&lt;type&gt;[A-Z]{3})\s?(?P&lt;baseattach&gt;BAS|ADD|ATT|SWS)?\s?(?P&lt;telco_overusage&gt;OVG)?\s?(?P&lt;termduration&gt;[A-Z0-9]{3})?\s?(?P&lt;billing&gt;[A-Z0-9]{3})\)$&#39;
name_re = re.compile(e_name)


data = {&#39;ServicePlan&#39;: [&quot;Plan A (CSP COM BAS 1YR ANN)&quot;,&quot;Plan A (CSP COM BAS 1YR ANN)&quot;,&quot;Plan A (CSP COM BAS 1YR ANN)&quot;],
        &#39;Resources&#39;: [&quot;Resource A (CSP COM BAS 1YR ANN)&quot;,&quot;Resource B (CSP COM BAS 1YR ANN)&quot;,&quot;Resource C (CSP COM BAS 6YR ANN)&quot;]}

df = pd.DataFrame(data)
print(df)
df[~(name_re.findall(df[&#39;ServicePlan&#39;].astype(str))[0][&#39;termduration&#39;]).ne(name_re.findall(df[&#39;Resource&#39;].astype(str))[0][&#39;termduration&#39;])]
print(df)

答案1

得分: 1

使用 pandas.Series.str.extract

df = df[df['ServicePlan'].str.extract(name_re, expand=False)['termduration']
        .eq(df['Resources'].str.extract(name_re, expand=False)['termduration'])]
print(df)

                       ServicePlan                         Resources
0  Plan A (CSP COM BAS 1YR ANN)  Resource A (CSP COM BAS 1YR ANN)
1  Plan A (CSP COM BAS 1YR ANN)  Resource B (CSP COM BAS 1YR ANN)
英文:

Use pandas.Series.str.extract:

df = df[df[&#39;ServicePlan&#39;].str.extract(name_re, expand=False)[&#39;termduration&#39;]
        .eq(df[&#39;Resources&#39;].str.extract(name_re, expand=False)[&#39;termduration&#39;])]
print(df)

                   ServicePlan                         Resources
0  Plan A (CSP COM BAS 1YR ANN)  Resource A (CSP COM BAS 1YR ANN)
1  Plan A (CSP COM BAS 1YR ANN)  Resource B (CSP COM BAS 1YR ANN)

huangapple
  • 本文由 发表于 2023年7月17日 23:42:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76706114.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定