英文:
Check duplicated indices for each subset of values in pandas dataframe
问题
我有以下的数据框:
import pandas as pd
df_test = pd.DataFrame(data=[['AP1', 'House1'],
['AP1', 'House1'],
['AP2', 'House1'],
['AP3', 'House2'],
['AP4','House2'],
['AP5', 'House2']],
columns=['AP', 'House'],
index=[0, 1, 2, 0, 1, 1])
我需要检查每列值的子集,看看是否有重复的索引。例如,在House列中,我们有三个House1的条目,并且没有重复的索引。但对于House2的条目,我们有一个重复的索引1。
我尝试了这个:
print(f'{df_test.index.duplicated().sum()} 个重复的条目')
但这会返回3个重复的条目,因为它没有单独考虑每列的值。
英文:
I have the following dataframe:
import pandas as pd
df_test = pd.DataFrame(data=[['AP1', 'House1'],
['AP1', 'House1'],
['AP2', 'House1'],
['AP3', 'House2'],
['AP4','House2'],
['AP5', 'House2']],
columns=['AP', 'House'],
index=[0, 1, 2, 0, 1, 1])
I need to check at each subset of values of a column and see if there are duplicated indices. For example, in column House, we have three entries of House1 and no duplicated indices. But for entry House2 we have one duplicated index 1.
I have tried this:
print(f'{df_test.index.duplicated().sum()} repeated entries')
But this gives 3 duplicated entries, since it does not consider each value of the column separately.
答案1
得分: 2
一个可能的解决方案:
print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())
输出:
0
1
英文:
A possible solution:
print(df_test.reset_index().duplicated(['index', 'AP']).sum())
print(df_test.reset_index().duplicated(['index', 'House']).sum())
Output:
0
1
答案2
得分: 2
你可以使用以下代码:
>>> (df_test.reset_index(names='Dups')
.groupby('House', as_index=False)['Dups']
.agg(lambda x: x.duplicated().sum()))
House Dups
0 House1 0
1 House2 1
英文:
You can use:
>>> (df_test.reset_index(names='Dups')
.groupby('House', as_index=False)['Dups']
.agg(lambda x: x.duplicated().sum()))
House Dups
0 House1 0
1 House2 1
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论