在列中筛选包含子字符串的pandas数据框。

huangapple go评论68阅读模式
英文:

Filtering a pandas dataframe based presence of substrings in column

问题

根据是否在列“report”中找到terms中的子字符串,并保留item,您可以将DataFrame减小为以下行:

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"
英文:

Not sure if this is a 'filtering with pandas' question or one of text analysis, however:

Given a df,

d = {
    "item": ["a", "b", "c", "d"],
    "report": [
        "john rode the subway through new york",
        "sally says she no longer wanted any fish, but",
        "was not submitted",
        "the doctor proceeded to call washington and new york",
    ],
}
df = pd.DataFrame(data=d)
df

Resulting in

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"

And a list of terms to match:

terms = ["new york", "fish"]

How would you reduce the the df to have the following rows, based on whether a substring in terms is found in column report and so that item is preserved?

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"

答案1

得分: 2

尝试这样做:

在正则表达式中使用单词边界将确保匹配 "fish",但不会匹配 "fishy"(作为示例)

m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))

df2 = df.loc[m]

输出:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...
英文:

Try this:

Using a word boundary with your regex will ensure that "fish" will get matched, but "fishy" will not (as an example)

m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))

df2 = df.loc[m]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

答案2

得分: 1

从另一个答案中提取 [这里][1]

您可以将您的 `terms` 转换为一个可用于正则表达式的单个字符串即使用 `|` 分隔),然后使用 `df.Series.str.contains`。

```python
term_str = ''|''.join(terms) # 创建一个字符串 ''new york|fish''
df[df['report'].str.contains(term_str)]

<details>
<summary>英文:</summary>

Pulling from another answer [here][1]:

You can change your `terms` into a regex-usable single string (that is, `|` delimited) and then use `df.Series.str.contains`.

```python
term_str = &#39;|&#39;.join(terms) # makes a string of &#39;new york|fish&#39;
df[df[&#39;report&#39;].str.contains(term_str)]

答案3

得分: 1

df[df['report'].apply(lambda x: any(term in x for term in terms))]

Output:

      item                                             report
    0    a              john rode the subway through new york
    1    b      sally says she no longer wanted any fish, but
    3    d  the doctor proceeded to call washington and ne...
英文:

Try this:

df[df[&#39;report&#39;].apply(lambda x: any(term in x for term in terms))]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

答案4

得分: 1

另一种可能的解决方案,基于 numpy

strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)

index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)

df.loc[mask]

输出:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...
英文:

Another possible solution, which is based on numpy:

strings = np.array(df[&#39;report&#39;], dtype=str)
substrings = np.array(terms)

index = np.char.find(strings[:, None], substrings)
mask = (index &gt;= 0).any(axis=1)

df.loc[mask]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

huangapple
  • 本文由 发表于 2023年2月9日 02:08:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/75390017.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定