无法在Pandas中基于子字符串进行筛选。

huangapple go评论70阅读模式
英文:

Unable to Filter based on Substring in Pandas

问题

以下是翻译好的部分:

"company_url Name Revenue
mackter.com Mack Sander NaN
nientact.com Neient Dan 321
ventienty.com Richard NaN

所以,我的任务是删除所有在'company_url'或'Name'列中出现字符串'tac'、'bux'或'mvy'的行...正如你所看到的,'tac'在nientact.com中出现,所以该行应该被删除...同样,所有在'company_url'或'Name'中出现这3个字符串中的任何一个的行都应该被删除...所以,最初我尝试了'company_url'列,并编写了以下代码,但它显示错误。

lists=['tac', 'bux', 'mvy']
for i in lists:
df = df[~df['company_url'].str.contains(i)]

但它显示
TypeError: unhashable type: 'list'"

英文:

There is a dataset in this form:

company_url         Name                  Revenue
mackter.com         Mack Sander           NaN
nientact.com        Neient Dan            321
ventienty.com       Richard               NaN

So, my task here is to remove all the rows where string 'tac', 'bux' or 'mvy' is coming in either 'company_url' or 'Name' column.... As you can see, 'tac' is present in nientact.com , so the row should get deleted... Similarly, all the rows where any of these 3 string are present in either company_url or Name, the rows should get deleted.... SO, Initially I tried it for company_url column and written the below code, but it's showing error.

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df['company_url].str.contains(i)]

but its showing
TypeError: unhashable type: 'list'

答案1

得分: 2

你可以创建一个正则表达式来与str.contains一起使用,然后使用any进行聚合,使用~进行取反操作,并执行布尔索引

import re

lists = ['tac', 'bux', 'mvy']
pattern = '|'.join(map(re.escape, lists))
# 'tac|bux|mvy'

out = df[~df[['company_url', 'Name']]
          .apply(lambda s: s.str.contains(pattern, case=False))
          .any(axis=1)
        ]

输出:

     company_url         Name  Revenue
0    mackter.com  Mack Sander      NaN
2  ventienty.com      Richard      NaN

仅供参考,修复你的循环,但不建议使用,因为这是低效的:

lists = ['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df[['company_url', 'Name']]
               .apply(lambda s: s.str.contains(i))
               .any(axis=1)]

# 或者

lists = ['tac', 'bux', 'mvy']
for i in lists:
    for col in ['company_url', 'Name']:
        df = df[~df[col].str.contains(i)]
英文:

You can craft a regex to use with str.contains, then aggregate with any, invert with ~, and perform boolean indexing:

import re

lists = ['tac', 'bux', 'mvy']
pattern = '|'.join(map(re.escape, lists))
# 'tac|bux|mvy'

out = df[~df[['company_url', 'Name']]
          .apply(lambda s: s.str.contains(pattern, case=False))
                            .any(axis=1)
        ]

Output:

     company_url         Name  Revenue
0    mackter.com  Mack Sander      NaN
2  ventienty.com      Richard      NaN

Just for info a fix of your loop, but don't use it as this is inefficient:

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df[['company_url', 'Name']]
               .apply(lambda s: s.str.contains(i))
               .any(axis=1)]

# or

lists=['tac', 'bux', 'mvy']
for i in lists:
    for col in ['company_url', 'Name']:
        df = df[~df[col].str.contains(i)]

huangapple
  • 本文由 发表于 2023年8月9日 15:29:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76865511-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定