英文:
A question in deleting duplicated index of DataFrame
问题
我正在预处理我的数据集,并发现存在一些重复的索引。
我尝试使用.drop_duplicates()
函数解决这个问题,但失败了。
然后我转向.duplicated()
,它起作用了...但我不明白它们之间的区别。
下面是一个简单的示例。
我打印df.index.drop_duplicates(keep="first")
,它是正确的,但在.loc
之后,重复的索引仍然存在。
有人可以帮忙解释这里的区别吗?非常感谢!
英文:
I'm preprocessing my dataset and find that there are some duplicated indexes.
I've tried to solve this problem with .drop_duplicates() function but failed.
And I turn to .duplicated() and it works...but I dont understand the difference between them.
The figure is a simple example.
I print(df.index.drop_duplicates(keep="first")) and it is correct, but after .loc the duplicated indexes still exist.
Could anyone help to explain the difference here please, thanks a lot!
答案1
得分: 1
你必须使用df.index.duplicated
如果你使用.loc
而不是df.index.drop_duplicates()
,因为后者返回唯一索引的列表['a', 'b']
,所以.loc
返回所有索引在此列表中找到的行。实际上,df.index.drop_duplicates
在这里的行为与df.index.unique()
相同:
# 示例(稍作修改)
>>> df = pd.DataFrame({'A': [1,2,3,4,0], 'B': [5,6,7,8,0], 'C': [9,10,11,12,0]},
index=['a', 'a', 'b', 'b', 'c'])
# 与 df.index.unique() 相同
>>> df.index.drop_duplicates()
Index(['a', 'b', 'c'], dtype='object')
# 与 df.loc[df.index.unique()] 相同
>>> df.loc[df.index.drop_duplicates(keep='first')]
A B C
a 1 5 9
a 2 6 10
b 3 7 11
b 4 8 12
c 0 0 0
>>> df.loc[~df.index.duplicated(keep='first')]
A B C
a 1 5 9
b 3 7 11
c 0 0 0
英文:
You must use df.index.duplicated
if you use .loc
and not df.index.drop_duplicates()
because the latter returns a list of unique indexes ['a', 'b']
so .loc
returns all rows where the index is found in this list. In fact, df.index.drop_duplicates
has the same behavior as df.index.unique()
here:
# Sample (slightly modified)
>>> df = pd.DataFrame({'A': [1,2,3,4,0], 'B': [5,6,7,8,0], 'C': [9,10,11,12,0]},
index=['a', 'a', 'b', 'b', 'c'])
# Same as df.index.unique()
>>> df.index.drop_duplicates()
Index(['a', 'b', 'c'], dtype='object')
# Same as df.loc[df.index.unique()]
>>> df.loc[df.index.drop_duplicates(keep='first')]
A B C
a 1 5 9
a 2 6 10
b 3 7 11
b 4 8 12
c 0 0 0
>>> df.loc[~df.index.duplicated(keep='first')]
A B C
a 1 5 9
b 3 7 11
c 0 0 0
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论