Python仅删除重复的对。

huangapple go评论63阅读模式
英文:

Python drop duplicated pairs only

问题

我想保留唯一的配对。也就是说,.565333时间戳的第一对和第二对实际上是唯一的,但是像这样做t[~t.duplicated()]将删除所有这样的重复项:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

而我希望得到的是这样:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1
英文:

If I have a dataframe like this:

Time                        X     Y
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1
2023-02-01T15:03:02.943522  200   10.1

I would like to remove duplicated PAIRS only. i.e. The first pair and second pair of .565333 timestamps are actually unique, but doing something like t[~t.duplicated()] will remove all the duplicates like this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

whereas instead I want this:

2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.565333  200   10.1
2023-02-01T15:03:02.634508  200   10.1
2023-02-01T15:03:02.943522  200   10.1

答案1

得分: 2

使用groupby.cumcount来去重,结合地板除法(floordiv)按行分组的数量,然后您可以在保留每个组的第一行的情况下进行drop_duplicates

N = 2
cols = ['Time', 'X', 'Y']
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
   .drop_duplicates(subset=cols+['n'])
)

注意:您可以使用任何N值来处理更大的分组大小,例如,N=3可用于处理三行的组合。此外,cols定义要用于标识重复项的列。我假设您想使用所有列,但如果需要,您也可以仅使用其中的一部分。

输出:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
2  2023-02-01T15:03:02.565333  200  10.1  1  # 第二对
4  2023-02-01T15:03:02.634508  200  10.1  0
6  2023-02-01T15:03:02.943522  200  10.1  0

drop_duplicates之前的中间结果:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
1  2023-02-01T15:03:02.565333  200  10.1  0  # 重复
2  2023-02-01T15:03:02.565333  200  10.1  1
3  2023-02-01T15:03:02.565333  200  10.1  1  # 重复
4  2023-02-01T15:03:02.634508  200  10.1  0
5  2023-02-01T15:03:02.634508  200  10.1  0  # 重复
6  2023-02-01T15:03:02.943522  200  10.1  0
7  2023-02-01T15:03:02.943522  200  10.1  0  # 重复
英文:

First de-duplicate using groupby.cumcount combined with floor division (floordiv) by the number of rows to group, then you will be able to drop_duplicates while maintaining the first row of each group:

N = 2
cols = ['Time', 'X', 'Y']
(df.assign(n=df.groupby(cols).cumcount().floordiv(N))
   .drop_duplicates(subset=cols+['n'])
)

NB. you can use any N value to work on higher group sizes, for example N=3 to work with triplets of rows. Also, cols defines the columns to use to identify the duplicates. I assumed you want to use all columns, but you can use only a subset of them if needed.

Output:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
2  2023-02-01T15:03:02.565333  200  10.1  1  # second pair
4  2023-02-01T15:03:02.634508  200  10.1  0
6  2023-02-01T15:03:02.943522  200  10.1  0

Intermediate before drop_duplicates:

                         Time    X     Y  n
0  2023-02-01T15:03:02.565333  200  10.1  0
1  2023-02-01T15:03:02.565333  200  10.1  0  # duplicated
2  2023-02-01T15:03:02.565333  200  10.1  1
3  2023-02-01T15:03:02.565333  200  10.1  1  # duplicated
4  2023-02-01T15:03:02.634508  200  10.1  0
5  2023-02-01T15:03:02.634508  200  10.1  0  # duplicated
6  2023-02-01T15:03:02.943522  200  10.1  0
7  2023-02-01T15:03:02.943522  200  10.1  0  # duplicated

答案2

得分: 0

你应该使用带有 subset 参数的重复方法来指定用于识别重复项的列。在这种情况下,你想要仅基于Time(X, Y)值对考虑重复项。也许这会对你有所帮助:

import pandas as pd

# 创建示例数据框
df = pd.DataFrame({
    'Time': ['2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.634508', '2023-02-01T15:03:02.634508',
             '2023-02-01T15:03:02.943522', '2023-02-01T15:03:02.943522'],
    'X': [200, 200, 200, 200, 200, 200, 200, 200],
    'Y': [10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1]
})

# 基于Time和(X,Y)对标识重复项
duplicates = df.duplicated(subset=['Time', 'X', 'Y'])

# 反转布尔掩码以选择非重复项
non_duplicates = ~duplicates

# 打印非重复项
print(df[non_duplicates])
英文:

You should use the duplicated method with the subset parameter to specify the columns to consider for identifying duplicates. In this case, you want to consider duplicates only based on the pairs of Time and (X, Y) values. Maybe it help you:

import pandas as pd

# Create example dataframe
df = pd.DataFrame({
    'Time': ['2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.565333', '2023-02-01T15:03:02.565333',
             '2023-02-01T15:03:02.634508', '2023-02-01T15:03:02.634508',
             '2023-02-01T15:03:02.943522', '2023-02-01T15:03:02.943522'],
    'X': [200, 200, 200, 200, 200, 200, 200, 200],
    'Y': [10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1, 10.1]
})

# Identify duplicates based on Time and (X,Y) pairs
duplicates = df.duplicated(subset=['Time', 'X', 'Y'])

# Invert boolean mask to select non-duplicates
non_duplicates = ~duplicates

# Print non-duplicates
print(df[non_duplicates])

huangapple
  • 本文由 发表于 2023年2月24日 14:36:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75553286.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定