英文:
Check duplicated indices for each subset of values in pandas dataframe
问题
我有以下的数据框:
import pandas as pd
df_test = pd.DataFrame(data=[['AP1', 'House1'],
['AP1', 'House1'],
['AP2', 'House1'],
['AP3', 'House2'],
['AP4','House2'],
['AP5', 'House2']],
columns=['AP', 'House'],
index=[0, 1, 2, 0, 1, 1])
我需要检查每列值的子集,看看是否有重复的索引。例如,在House
列中,我们有三个House1
的条目,并且没有重复的索引。但对于House2
的条目,我们有一个重复的索引1
。
我尝试了这个:
print(f'{df_test.index.duplicated().sum()} 个重复的条目')
但这会返回3
个重复的条目,因为它没有单独考虑每列的值。
英文:
I have the following dataframe:
import pandas as pd
df_test = pd.DataFrame(data=[['AP1', 'House1'],
['AP1', 'House1'],
['AP2', 'House1'],
['AP3', 'House2'],
['AP4','House2'],
['AP5', 'House2']],
columns=['AP', 'House'],
index=[0, 1, 2, 0, 1, 1])
I need to check at each subset of values of a column and see if there are duplicated indices. For example, in column House
, we have three entries of House1
and no duplicated indices. But for entry House2
we have one duplicated index 1
.
I have tried this:
print(f'{df_test.index.duplicated().sum()} repeated entries')
But this gives 3
duplicated entries, since it does not consider each value of the column separately.
答案1
得分: 2
一个可能的解决方案:
print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())
输出:
0
1
英文:
A possible solution:
print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())
Output:
0
1
答案2
得分: 2
你可以使用以下代码:
>>> (df_test.reset_index(names='Dups')
.groupby('House', as_index=False)['Dups']
.agg(lambda x: x.duplicated().sum()))
House Dups
0 House1 0
1 House2 1
英文:
You can use:
>>> (df_test.reset_index(names='Dups')
.groupby('House', as_index=False)['Dups']
.agg(lambda x: x.duplicated().sum()))
House Dups
0 House1 0
1 House2 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论