2023年2月24日 14:36:16go评论102阅读模式

英文:

Python drop duplicated pairs only

问题

我想保留唯一的配对。也就是说，.565333时间戳的第一对和第二对实际上是唯一的，但是像这样做t[~t.duplicated()]将删除所有这样的重复项：

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

而我希望得到的是这样：

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

英文:

If I have a dataframe like this:

Time                        X     Y
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1
2023-02-01T15:03:02.943522  200   10.1

I would like to remove duplicated PAIRS only. i.e. The first pair and second pair of .565333 timestamps are actually unique, but doing something like t[~t.duplicated()] will remove all the duplicates like this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

whereas instead I want this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

答案1

得分: 2

使用groupby.cumcount来去重，结合地板除法（floordiv）按行分组的数量，然后您可以在保留每个组的第一行的情况下进行drop_duplicates：

N = 2
cols = ['Time', 'X', 'Y']
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
   .drop_duplicates(subset=cols+['n'])
)

注意：您可以使用任何N值来处理更大的分组大小，例如，N=3可用于处理三行的组合。此外，cols定义要用于标识重复项的列。我假设您想使用所有列，但如果需要，您也可以仅使用其中的一部分。

输出：

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
2  2023-02-01T15:03:02.565333  200  10.1  1  # 第二对
4  2023-02-01T15:03:02.634508  200  10.1  0
6  2023-02-01T15:03:02.943522  200  10.1  0

在drop_duplicates之前的中间结果：

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
1  2023-02-01T15:03:02.565333  200  10.1  0  # 重复
2  2023-02-01T15:03:02.565333  200  10.1  1
3  2023-02-01T15:03:02.565333  200  10.1  1  # 重复
4  2023-02-01T15:03:02.634508  200  10.1  0
5  2023-02-01T15:03:02.634508  200  10.1  0  # 重复
6  2023-02-01T15:03:02.943522  200  10.1  0
7  2023-02-01T15:03:02.943522  200  10.1  0  # 重复

英文:

First de-duplicate using groupby.cumcount combined with floor division (floordiv) by the number of rows to group, then you will be able to drop_duplicates while maintaining the first row of each group:

N = 2
cols = [&#39;Time&#39;, &#39;X&#39;, &#39;Y&#39;]
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
   .drop_duplicates(subset=cols+[&#39;n&#39;])
)

NB. you can use any N value to work on higher group sizes, for example N=3 to work with triplets of rows. Also, cols defines the columns to use to identify the duplicates. I assumed you want to use all columns, but you can use only a subset of them if needed.

Output:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
2  2023-02-01T15:03:02.565333  200  10.1  1  # second pair
4  2023-02-01T15:03:02.634508  200  10.1  0
6  2023-02-01T15:03:02.943522  200  10.1  0

Intermediate before drop_duplicates:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
1  2023-02-01T15:03:02.565333  200  10.1  0  # duplicated
2  2023-02-01T15:03:02.565333  200  10.1  1
3  2023-02-01T15:03:02.565333  200  10.1  1  # duplicated
4  2023-02-01T15:03:02.634508  200  10.1  0
5  2023-02-01T15:03:02.634508  200  10.1  0  # duplicated
6  2023-02-01T15:03:02.943522  200  10.1  0
7  2023-02-01T15:03:02.943522  200  10.1  0  # duplicated

答案2

得分: 0

你应该使用带有 subset 参数的重复方法来指定用于识别重复项的列。在这种情况下，你想要仅基于Time和(X, Y)值对考虑重复项。也许这会对你有所帮助：

import pandas as pd
# 创建示例数据框
df = pd.DataFrame({
    'Time': ['2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.634508', '2023-02-01T15:03:02.634508',
             '2023-02-01T15:03:02.943522', '2023-02-01T15:03:02.943522'],
    'X': [200, 200, 200, 200, 200, 200, 200, 200],
    'Y': [10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1]
})
# 基于Time和(X,Y)对标识重复项
duplicates = df.duplicated(subset=['Time', 'X', 'Y'])
# 反转布尔掩码以选择非重复项
non_duplicates = ~duplicates
# 打印非重复项
print(df[non_duplicates])

英文:

You should use the duplicated method with the subset parameter to specify the columns to consider for identifying duplicates. In this case, you want to consider duplicates only based on the pairs of Time and (X, Y) values. Maybe it help you:

import pandas as pd
# Create example dataframe
df = pd.DataFrame({
    &#39;Time&#39;: [&#39;2023-02-01T15:03:02.565333&#39;, &#39;2023-02-01T15:03:02.565333&#39;,
             &#39;2023-02-01T15:03:02.565333&#39;, &#39;2023-02-01T15:03:02.565333&#39;,
             &#39;2023-02-01T15:03:02.634508&#39;, &#39;2023-02-01T15:03:02.634508&#39;,
             &#39;2023-02-01T15:03:02.943522&#39;, &#39;2023-02-01T15:03:02.943522&#39;],
    &#39;X&#39;: [200, 200, 200, 200, 200, 200, 200, 200],
    &#39;Y&#39;: [10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1]
})
# Identify duplicates based on Time and (X,Y) pairs
duplicates = df.duplicated(subset=[&#39;Time&#39;, &#39;X&#39;, &#39;Y&#39;])
# Invert boolean mask to select non-duplicates
non_duplicates = ~duplicates
# Print non-duplicates
print(df[non_duplicates])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python仅删除重复的对。

问题

答案1

答案2

将字符串列表转换为正确的列表

Python点击文本定义的矩形时为什么会返回AttributeError？

数据类和插槽导致值错误：`b` 在 `slots` 中与类变量冲突

Python ProcessPoolExecutor 问题

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。