检查 Pandas 数据框中每个数值子集的重复索引。

huangapple go评论53阅读模式
英文:

Check duplicated indices for each subset of values in pandas dataframe

问题

我有以下的数据框:

import pandas as pd

df_test = pd.DataFrame(data=[['AP1', 'House1'],
                             ['AP1', 'House1'], 
                             ['AP2', 'House1'], 
                             ['AP3', 'House2'], 
                             ['AP4','House2'], 
                             ['AP5', 'House2']],
                       columns=['AP', 'House'],
                       index=[0, 1, 2, 0, 1, 1])

我需要检查每列值的子集,看看是否有重复的索引。例如,在House列中,我们有三个House1的条目,并且没有重复的索引。但对于House2的条目,我们有一个重复的索引1

我尝试了这个:

print(f'{df_test.index.duplicated().sum()} 个重复的条目')

但这会返回3个重复的条目,因为它没有单独考虑每列的值。

英文:

I have the following dataframe:

import pandas as pd

df_test = pd.DataFrame(data=[['AP1', 'House1'],
                             ['AP1', 'House1'], 
                             ['AP2', 'House1'], 
                             ['AP3', 'House2'], 
                             ['AP4','House2'], 
                             ['AP5', 'House2']],
                       columns=['AP', 'House'],
                       index=[0, 1, 2, 0, 1, 1])

I need to check at each subset of values of a column and see if there are duplicated indices. For example, in column House, we have three entries of House1 and no duplicated indices. But for entry House2 we have one duplicated index 1.

I have tried this:

print(f'{df_test.index.duplicated().sum()} repeated entries')

But this gives 3 duplicated entries, since it does not consider each value of the column separately.

答案1

得分: 2

一个可能的解决方案:

print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())

输出:

0
1
英文:

A possible solution:

print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())

Output:

0
1

答案2

得分: 2

你可以使用以下代码:

>>> (df_test.reset_index(names='Dups')
            .groupby('House', as_index=False)['Dups']
            .agg(lambda x: x.duplicated().sum()))

    House  Dups
0  House1     0
1  House2     1
英文:

You can use:

>>> (df_test.reset_index(names='Dups')
            .groupby('House', as_index=False)['Dups']
            .agg(lambda x: x.duplicated().sum()))

    House  Dups
0  House1     0
1  House2     1

huangapple
  • 本文由 发表于 2023年4月4日 05:31:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/75923928.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定