2023年6月26日 05:25:30go评论101阅读模式

英文:

How to drop duplicated rows based on pattern change in dataframe?

问题

假设我有一个像下面这样的数据框，其中包含特定日期的各个类别的百分比，以及唯一ID：

    import pandas as pd
    df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
                       "DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
                               "08-1990"],
                       "CLASS A":[30,30,0,0,0,30,30],
                       "CLASS B":[50,50,50,50,50,50,50],
                       "CLASS C":[20,20,50,50,50,20,20]})
    df
    Out[4]: 
      ID     DATE  CLASS A  CLASS B  CLASS C
    0  A  01-1990       30       50       20
    1  A  03-1990       30       50       20
    2  A  04-1990        0       50       50
    3  A  05-1990        0       50       50
    4  A  06-1990        0       50       50
    5  A  07-1990       30       50       20
    6  A  08-1990       30       50       20   
我想根据ID，CLASS A，CLASS B和CLASS C删除重复的行（保留第一行），但只在它改变为另一种百分比模式之前这样做。在这个例子中，有2个模式的变化（30/50/20到0/50/50，然后再到30/50/20）。结果应该如下所示：
```python
      ID     DATE  CLASS A  CLASS B  CLASS C
    0  A  01-1990       30       50       20
    2  A  04-1990        0       50       50
    5  A  07-1990       30       50       20

我知道如何基于整个数据框删除重复的行（df.drop_duplicates），但在这种情况下无法直接做到这一点，因为这样做会将索引5和6的行也删除。有人能帮帮我吗？

英文:

Imagine I have a dataframe like this one below, with percentages by classes in specific dates for unique IDs:

import pandas as pd
df = pd.DataFrame({&quot;ID&quot;:[&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;,&quot;A&quot;],
                   &quot;DATE&quot;:[&quot;01-1990&quot;,&quot;03-1990&quot;,&quot;04-1990&quot;,&quot;05-1990&quot;,&quot;06-1990&quot;,&quot;07-1990&quot;,
                           &quot;08-1990&quot;],
                   &quot;CLASS A&quot;:[30,30,0,0,0,30,30],
                   &quot;CLASS B&quot;:[50,50,50,50,50,50,50],
                   &quot;CLASS C&quot;:[20,20,50,50,50,20,20]})
df
Out[4]: 
  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
1  A  03-1990       30       50       20
2  A  04-1990        0       50       50
3  A  05-1990        0       50       50
4  A  06-1990        0       50       50
5  A  07-1990       30       50       20
6  A  08-1990       30       50       20

I would like to drop duplicated rows based on ID, CLASS A, CLASS B and CLASS C (and keep the first one), but only before it changes to another pattern of percentage. In this example, there are 2 changes of pattern (30/50/20 to 0/50/50 and then to 30/50/20 again). The result should be like this:

  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
2  A  04-1990        0       50       50
5  A  07-1990       30       50       20

I know how to remove duplicated rows based on the whole dataframe (df.drop_duplicates), but can't do this directly in this case as it would remove the rows from index 5 and 6 as well. Anyone could help me?

答案1

得分: 3

在您的情况下，我不会使用 drop_duplicates，而是使用 shift 方法获取要保留的索引。

类似以下代码：

compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]

英文:

In your case, I wouldn't use drop_duplicates but get the indices to keep using the shift method.

Something like:

compare_df = df[[&quot;ID&quot;, &quot;CLASS A&quot;, &quot;CLASS B&quot;, &quot;CLASS C&quot;]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何根据数据框中的模式变化删除重复的行？

问题

答案1

Python D-Bus：使用dasbus订阅信号并读取属性

如何将使用pytesseract.image_to_string提取的信息转换为数据框？

每个循环后同一年份的柱状间隙

Python len()函数在处理列表时报告’float’错误。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。