2023年8月9日 15:29:02go评论86阅读模式

英文:

Unable to Filter based on Substring in Pandas

问题

这是一个数据集的形式：

公司网址         名称                  收入
mackter.com         Mack Sander           NaN
nientact.com        Neient Dan            321
ventienty.com       Richard               NaN

所以，我的任务是删除所有在'company_url'或'Name'列中出现字符串'tac'、'bux'或'mvy'的行... 如你所见，'tac'出现在nientact.com中，所以该行应该被删除... 类似地，所有在'company_url'或'Name'中出现这3个字符串中的任何一个的行，都应该被删除... 所以，最初我尝试了针对'company_url'列的代码，并编写了以下代码，但是它显示错误。

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df['company_url'].str.contains(i)]

但是它显示
TypeError: unhashable type: 'list'

英文:

There is a dataset in this form:

company_url         Name                  Revenue
mackter.com         Mack Sander           NaN
nientact.com        Neient Dan            321
ventienty.com       Richard               NaN

So, my task here is to remove all the rows where string 'tac', 'bux' or 'mvy' is coming in either 'company_url' or 'Name' column.... As you can see, 'tac' is present in nientact.com , so the row should get deleted... Similarly, all the rows where any of these 3 string are present in either company_url or Name, the rows should get deleted.... SO, Initially I tried it for company_url column and written the below code, but it's showing error.

lists=[&#39;tac&#39;, &#39;bux&#39;, &#39;mvy&#39;]
for i in lists:
    df = df[~df[&#39;company_url].str.contains(i)]

but its showing
TypeError: unhashable type: 'list'

答案1

得分: 2

你可以使用正则表达式来配合str.contains方法，然后使用any方法进行聚合，再使用~进行取反操作，并进行布尔索引：

import re
lists = ['tac', 'bux', 'mvy']
pattern = '|'.join(map(re.escape, lists))
# 'tac|bux|mvy'
out = df[~df[['company_url', 'Name']]
          .apply(lambda s: s.str.contains(pattern, case=False))
          .any(axis=1)
        ]

输出结果为：

     company_url         Name  Revenue
0    mackter.com  Mack Sander      NaN
2  ventienty.com      Richard      NaN

只是提供一个修复你循环的方法，但不要使用它，因为这样效率低下：

lists=['tac', 'bux', 'mvy']
for i in lists:
    df = df[~df[['company_url', 'Name']]
               .apply(lambda s: s.str.contains(i))
               .any(axis=1)]
# 或者
lists=['tac', 'bux', 'mvy']
for i in lists:
    for col in ['company_url', 'Name']:
        df = df[~df[col].str.contains(i)]

英文:

You can craft a regex to use with str.contains, then aggregate with any, invert with ~, and perform boolean indexing:

import re
lists = [&#39;tac&#39;, &#39;bux&#39;, &#39;mvy&#39;]
pattern = &#39;|&#39;.join(map(re.escape, lists))
# &#39;tac|bux|mvy&#39;
out = df[~df[[&#39;company_url&#39;, &#39;Name&#39;]]
          .apply(lambda s: s.str.contains(pattern, case=False))
                            .any(axis=1)
        ]

Output:

     company_url         Name  Revenue
0    mackter.com  Mack Sander      NaN
2  ventienty.com      Richard      NaN

Just for info a fix of your loop, but don't use it as this is inefficient:

lists=[&#39;tac&#39;, &#39;bux&#39;, &#39;mvy&#39;]
for i in lists:
    df = df[~df[[&#39;company_url&#39;, &#39;Name&#39;]]
               .apply(lambda s: s.str.contains(i))
               .any(axis=1)]
# or
lists=[&#39;tac&#39;, &#39;bux&#39;, &#39;mvy&#39;]
for i in lists:
    for col in [&#39;company_url&#39;, &#39;Name&#39;]:
        df = df[~df[col].str.contains(i)]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

无法在 Pandas 中基于子字符串进行过滤。

问题

答案1

根据数据框列的值拆分数字。

如何计算在我的CSV文件中打印的单词“true”的次数？

python code to extract a record from a data frame from excel based on condition and create and input as column value

Pandas多级索引和约束返回不可对齐的布尔系列。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。