英文:
Removing rows containing specific string value when a number of conditions are met (pandas)
问题
如果我有以下的数据框:
df = """
doc_id source sentence_number Species Valve Fretindex
1 33075059515 AAP_news 2 fish 30x SAR
2 33075059515 AAP_news 2 snapper 30x SAR
3 53075235984 Goa Herald 5 fish 50y SAR
4 28865465787 Bulletin 4 turtle 50y SAR
5 28865465787 Bulletin 4 shark 50y NaN
6 28865465787 Bulletin 3 fish NaN SAR
7 87652548931 Bulletin 1 fish NaN NaN
8 87652548931 Bulletin 1 turtle NaN NaN
如何移除所有包含列'Species'值为'fish'的行,前提是在具有相同'doc_id'、'sentence_number'、'Valve'和'Fretindex'值的另一行中存在不同于'fish'的'Species'值(例如snapper、shark或其他)?
我尝试了不同版本的pandas中的duplicate函数和drop.duplicates函数,如以下代码:
result = df[((result1.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)) &
(df['SPECIES'] == df['fish'] )
~df.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)]
但是它没有提供正确的输出。我想要实现的是:
df = """
doc_id source sentence_number Species Valve Fretindex
2 33075059515 AAP_news 2 snapper 30x SAR
3 53075235984 Goa Herald 5 fish 50y SAR
4 28865465787 Bulletin 4 turtle 50y SAR
5 28865465787 Bulletin 4 shark 50y NaN
6 28865465787 Bulletin 3 fish NaN SAR
8 87652548931 Bulletin 1 turtle NaN NaN
如何实现这个目标?
英文:
If I have the following dataframe:
df = """
doc_id source sentence_number Species Valve Fretindex
1 33075059515 AAP_news 2 fish 30x SAR
2 33075059515 AAP_news 2 snapper 30x SAR
3 53075235984 Goa Herald 5 fish 50y SAR
4 28865465787 Bulletin 4 turtle 50y SAR
5 28865465787 Bulletin 4 shark 50y NaN
6 28865465787 Bulletin 3 fish NaN SAR
7 87652548931 Bulletin 1 fish NaN NaN
8 87652548931 Bulletin 1 turtle NaN NaN
How can I remove all the rows containing value 'fish' for the column 'Species' IF there is a different value for 'Species' (e.g., snapper, shark or other) in another row with the same values for doc_id, sentence_number, Valve and Fretindex?
I tried different versions of the duplicate function and drop.duplicates in pandas such as the following code:
result = df[((result1.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)) &
(df['SPECIES'] == df['fish'] )
~df.duplicated(subset = ["doc_id", "sentence_number", "Valve", "Fretindex"], keep=False)]
However it is not giving the correct output. What I would like to achieve:
df = """
doc_id source sentence_number Species Valve Fretindex
2 33075059515 AAP_news 2 snapper 30x SAR
3 53075235984 Goa Herald 5 fish 50y SAR
4 28865465787 Bulletin 4 turtle 50y SAR
5 28865465787 Bulletin 4 shark 50y NaN
6 28865465787 Bulletin 3 fish NaN SAR
8 87652548931 Bulletin 1 turtle NaN NaN
答案1
得分: 0
以下是获取所需结果的一种方法:
df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].index)
你可以得到匹配 duplicated()
调用的行的 doc_id
,这些行具有列 doc_id、sentence_number、Valve 和 Fretindex 上的重复项。然后,使用最近获得的文档 ID 进行 df 的筛选。最后,检查 Species
列是否包含 fish
,并将那些条目删除。
你提到希望保留仅包含 fish
作为物种的重复条目。你可以在上述查询中添加 drop_duplicates
来实现:
df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].drop_duplicates(keep=False).index)
希望这对你有帮助。
英文:
The following is one way to get the desired result:
df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].index)
You get the doc_id
of the rows that matched duplicated()
call for columns doc_id, sentence_number, Valve and Fretindex. Then filter the df with an isin
with the doc ids you recently got. Finally, check if the Species
column contains fish
and drop
those entries.
You mentioned that you wanted to keep duplicate entries that only contained fish
as their species. You can do that with adding drop_duplicates
to the query above:
df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].drop_duplicates(keep=False).index)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论