在满足一定条件时删除包含特定字符串值的行(pandas)。

huangapple go评论55阅读模式
英文:

Removing rows containing specific string value when a number of conditions are met (pandas)

问题

如果我有以下的数据框:

df = """
    doc_id      source     sentence_number  Species   Valve   Fretindex
1  33075059515  AAP_news   2                fish      30x          SAR
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
7  87652548931  Bulletin   1                fish      NaN          NaN
8  87652548931  Bulletin   1                turtle    NaN          NaN

如何移除所有包含列'Species'值为'fish'的行,前提是在具有相同'doc_id'、'sentence_number'、'Valve'和'Fretindex'值的另一行中存在不同于'fish'的'Species'值(例如snapper、shark或其他)?

我尝试了不同版本的pandas中的duplicate函数和drop.duplicates函数,如以下代码:

result = df[((result1.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)) & 
         (df['SPECIES'] == df['fish'] )
        ~df.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)]

但是它没有提供正确的输出。我想要实现的是:

df = """
    doc_id      source     sentence_number  Species   Valve   Fretindex
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
8  87652548931  Bulletin   1                turtle    NaN          NaN

如何实现这个目标?

英文:

If I have the following dataframe:

df = """
    doc_id      source     sentence_number  Species   Valve   Fretindex
1  33075059515  AAP_news   2                fish      30x          SAR
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
7  87652548931  Bulletin   1                fish      NaN          NaN
8  87652548931  Bulletin   1                turtle    NaN          NaN

How can I remove all the rows containing value 'fish' for the column 'Species' IF there is a different value for 'Species' (e.g., snapper, shark or other) in another row with the same values for doc_id, sentence_number, Valve and Fretindex?

I tried different versions of the duplicate function and drop.duplicates in pandas such as the following code:

result = df[((result1.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)) & 
         (df['SPECIES'] == df['fish'] )
        ~df.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)]

However it is not giving the correct output. What I would like to achieve:

df = """
    doc_id      source     sentence_number  Species   Valve   Fretindex
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
8  87652548931  Bulletin   1                turtle    NaN          NaN

答案1

得分: 0

以下是获取所需结果的一种方法:

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].index)

你可以得到匹配 duplicated() 调用的行的 doc_id,这些行具有列 doc_id、sentence_number、Valve 和 Fretindex 上的重复项。然后,使用最近获得的文档 ID 进行 df 的筛选。最后,检查 Species 列是否包含 fish,并将那些条目删除。

你提到希望保留仅包含 fish 作为物种的重复条目。你可以在上述查询中添加 drop_duplicates 来实现:

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].drop_duplicates(keep=False).index)

希望这对你有帮助。

英文:

The following is one way to get the desired result:

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].index)

You get the doc_id of the rows that matched duplicated() call for columns doc_id, sentence_number, Valve and Fretindex. Then filter the df with an isin with the doc ids you recently got. Finally, check if the Species column contains fish and drop those entries.

You mentioned that you wanted to keep duplicate entries that only contained fish as their species. You can do that with adding drop_duplicates to the query above:

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].drop_duplicates(keep=False).index)

huangapple
  • 本文由 发表于 2023年3月7日 00:59:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/75653701.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定