2023年3月7日 00:59:22go评论58阅读模式

英文:

Removing rows containing specific string value when a number of conditions are met (pandas)

问题

如果我有以下的数据框：

df = &quot;&quot;&quot;
    doc_id      source     sentence_number  Species   Valve   Fretindex
1  33075059515  AAP_news   2                fish      30x          SAR
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
7  87652548931  Bulletin   1                fish      NaN          NaN
8  87652548931  Bulletin   1                turtle    NaN          NaN

如何移除所有包含列'Species'值为'fish'的行，前提是在具有相同'doc_id'、'sentence_number'、'Valve'和'Fretindex'值的另一行中存在不同于'fish'的'Species'值（例如snapper、shark或其他）？

我尝试了不同版本的pandas中的duplicate函数和drop.duplicates函数，如以下代码：

result = df[((result1.duplicated(subset = [&quot;doc_id&quot;, &quot;sentence_number&quot;, &quot;Valve&quot;, &quot;Fretindex&quot;], keep=False)) &amp; 
         (df[&#39;SPECIES&#39;] == df[&#39;fish&#39;] )
        ~df.duplicated(subset = [&quot;doc_id&quot;, &quot;sentence_number&quot;, &quot;Valve&quot;, &quot;Fretindex&quot;], keep=False)]

但是它没有提供正确的输出。我想要实现的是：

df = &quot;&quot;&quot;
    doc_id      source     sentence_number  Species   Valve   Fretindex
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
8  87652548931  Bulletin   1                turtle    NaN          NaN

如何实现这个目标？

英文:

If I have the following dataframe:

df = &quot;&quot;&quot;
    doc_id      source     sentence_number  Species   Valve   Fretindex
1  33075059515  AAP_news   2                fish      30x          SAR
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
7  87652548931  Bulletin   1                fish      NaN          NaN
8  87652548931  Bulletin   1                turtle    NaN          NaN

How can I remove all the rows containing value 'fish' for the column 'Species' IF there is a different value for 'Species' (e.g., snapper, shark or other) in another row with the same values for doc_id, sentence_number, Valve and Fretindex?

I tried different versions of the duplicate function and drop.duplicates in pandas such as the following code:

result = df[((result1.duplicated(subset = [&quot;doc_id&quot;, &quot;sentence_number&quot;, &quot;Valve&quot;, &quot;Fretindex&quot;], keep=False)) &amp; 
         (df[&#39;SPECIES&#39;] == df[&#39;fish&#39;] )
        ~df.duplicated(subset = [&quot;doc_id&quot;, &quot;sentence_number&quot;, &quot;Valve&quot;, &quot;Fretindex&quot;], keep=False)]

However it is not giving the correct output. What I would like to achieve:

df = &quot;&quot;&quot;
    doc_id      source     sentence_number  Species   Valve   Fretindex
2  33075059515  AAP_news   2                snapper   30x          SAR
3  53075235984  Goa Herald 5                fish      50y          SAR
4  28865465787  Bulletin   4                turtle    50y          SAR
5  28865465787  Bulletin   4                shark     50y          NaN
6  28865465787  Bulletin   3                fish      NaN          SAR
8  87652548931  Bulletin   1                turtle    NaN          NaN

答案1

得分: 0

以下是获取所需结果的一种方法：

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].index)

你可以得到匹配 duplicated() 调用的行的 doc_id，这些行具有列 doc_id、sentence_number、Valve 和 Fretindex 上的重复项。然后，使用最近获得的文档 ID 进行 df 的筛选。最后，检查 Species 列是否包含 fish，并将那些条目删除。

你提到希望保留仅包含 fish 作为物种的重复条目。你可以在上述查询中添加 drop_duplicates 来实现：

df.drop(df[(df["doc_id"].isin(list(df[(df[['doc_id', 'sentence_number', 'Valve', 'Fretindex']].duplicated())].doc_id))) & (df["Species"] == "fish")].drop_duplicates(keep=False).index)

希望这对你有帮助。

英文:

The following is one way to get the desired result:

df.drop(df[(df[&quot;doc_id&quot;].isin(list(df[(df[[&#39;doc_id&#39;, &#39;sentence_number&#39;, &#39;Valve&#39;, &#39;Fretindex&#39;]].duplicated())].doc_id))) &amp; (df[&quot;Species&quot;] == &quot;fish&quot;)].index)

You get the doc_id of the rows that matched duplicated() call for columns doc_id, sentence_number, Valve and Fretindex. Then filter the df with an isin with the doc ids you recently got. Finally, check if the Species column contains fish and drop those entries.

You mentioned that you wanted to keep duplicate entries that only contained fish as their species. You can do that with adding drop_duplicates to the query above:

df.drop(df[(df[&quot;doc_id&quot;].isin(list(df[(df[[&#39;doc_id&#39;, &#39;sentence_number&#39;, &#39;Valve&#39;, &#39;Fretindex&#39;]].duplicated())].doc_id))) &amp; (df[&quot;Species&quot;] == &quot;fish&quot;)].drop_duplicates(keep=False).index)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在满足一定条件时删除包含特定字符串值的行（pandas）。

问题

答案1

如何同时填充几列中的缺失数值

使用字典定义 Pandas 数据帧中计算出现次数的条件。

如何向OpenSky（飞行跟踪器）发出请求

替换列中的字符串部分

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论