2023年2月9日 02:08:25go评论102阅读模式

英文:

Filtering a pandas dataframe based presence of substrings in column

问题

根据是否在列“report”中找到terms中的子字符串，并保留item，您可以将DataFrame减小为以下行：

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"

英文:

Not sure if this is a 'filtering with pandas' question or one of text analysis, however:

Given a df,

d = {
    &quot;item&quot;: [&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;],
    &quot;report&quot;: [
        &quot;john rode the subway through new york&quot;,
        &quot;sally says she no longer wanted any fish, but&quot;,
        &quot;was not submitted&quot;,
        &quot;the doctor proceeded to call washington and new york&quot;,
    ],
}
df = pd.DataFrame(data=d)
df

Resulting in

item, report
a, &quot;john rode the subway through new york&quot;
b, &quot;sally says she no longer wanted any fish, but&quot;
c, &quot;was not submitted&quot;
d, &quot;the doctor proceeded to call washington and new york&quot;

And a list of terms to match:

terms = [&quot;new york&quot;, &quot;fish&quot;]

How would you reduce the the df to have the following rows, based on whether a substring in terms is found in column report and so that item is preserved?

item, report
a, &quot;john rode the subway through new york&quot;
b, &quot;sally says she no longer wanted any fish, but&quot;
d, &quot;the doctor proceeded to call washington and new york&quot;

答案1

得分: 2

尝试这样做：

在正则表达式中使用单词边界将确保匹配 "fish"，但不会匹配 "fishy"（作为示例）

m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))
df2 = df.loc[m]

输出：

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

英文:

Try this:

Using a word boundary with your regex will ensure that "fish" will get matched, but "fishy" will not (as an example)

m = df[&#39;report&#39;].str.contains(r&#39;\b{}\b&#39;.format(r&#39;\b|\b&#39;.join(terms)))
df2 = df.loc[m]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

答案2

得分: 1

从另一个答案中提取 [这里][1]：
您可以将您的 `terms` 转换为一个可用于正则表达式的单个字符串（即使用 `|` 分隔），然后使用 `df.Series.str.contains`。
```python
term_str = '&#39;|&#39;'.join(terms) # 创建一个字符串 '&#39;new york|fish&#39;'
df[df['report'].str.contains(term_str)]


<details>
<summary>英文:</summary>
Pulling from another answer [here][1]:
You can change your `terms` into a regex-usable single string (that is, `|` delimited) and then use `df.Series.str.contains`.
```python
term_str = &#39;|&#39;.join(terms) # makes a string of &#39;new york|fish&#39;
df[df[&#39;report&#39;].str.contains(term_str)]

答案3

得分: 1

df[df['report'].apply(lambda x: any(term in x for term in terms))]

Output:

      item                                             report
    0    a              john rode the subway through new york
    1    b      sally says she no longer wanted any fish, but
    3    d  the doctor proceeded to call washington and ne...

英文:

Try this:

df[df[&#39;report&#39;].apply(lambda x: any(term in x for term in terms))]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

答案4

得分: 1

另一种可能的解决方案，基于 numpy：

strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)
index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)
df.loc[mask]

输出：

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

英文:

Another possible solution, which is based on numpy:

strings = np.array(df[&#39;report&#39;], dtype=str)
substrings = np.array(terms)
index = np.char.find(strings[:, None], substrings)
mask = (index &gt;= 0).any(axis=1)
df.loc[mask]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在列中筛选包含子字符串的pandas数据框。

问题

答案1

答案2

答案3

答案4

如何修改我的凯撒密码实现以将结果输出为单个字符串？

使用 Polars 根据另一列的条件修改某列的一些行。

如何在xarray.DataArray中创建一个数据变量？

可以暗示一个函数参数不应被修改吗？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论