DataFrame中删除重复索引的问题

huangapple go评论67阅读模式
英文:

A question in deleting duplicated index of DataFrame

问题

我正在预处理我的数据集,并发现存在一些重复的索引。
我尝试使用.drop_duplicates()函数解决这个问题,但失败了。
然后我转向.duplicated(),它起作用了...但我不明白它们之间的区别。
下面是一个简单的示例。

一个简单的示例

我打印df.index.drop_duplicates(keep="first"),它是正确的,但在.loc之后,重复的索引仍然存在。
有人可以帮忙解释这里的区别吗?非常感谢!

英文:

I'm preprocessing my dataset and find that there are some duplicated indexes.
I've tried to solve this problem with .drop_duplicates() function but failed.
And I turn to .duplicated() and it works...but I dont understand the difference between them.
The figure is a simple example.

a simple example

I print(df.index.drop_duplicates(keep="first")) and it is correct, but after .loc the duplicated indexes still exist.
Could anyone help to explain the difference here please, thanks a lot!

答案1

得分: 1

你必须使用df.index.duplicated如果你使用.loc而不是df.index.drop_duplicates(),因为后者返回唯一索引的列表['a', 'b'],所以.loc返回所有索引在此列表中找到的行。实际上,df.index.drop_duplicates在这里的行为与df.index.unique()相同:

# 示例(稍作修改)
>>> df = pd.DataFrame({'A': [1,2,3,4,0], 'B': [5,6,7,8,0], 'C': [9,10,11,12,0]}, 
                      index=['a', 'a', 'b', 'b', 'c'])

# 与 df.index.unique() 相同
>>> df.index.drop_duplicates()
Index(['a', 'b', 'c'], dtype='object')

# 与 df.loc[df.index.unique()] 相同
>>> df.loc[df.index.drop_duplicates(keep='first')]
   A  B   C
a  1  5   9
a  2  6  10
b  3  7  11
b  4  8  12
c  0  0   0

>>> df.loc[~df.index.duplicated(keep='first')]
   A  B   C
a  1  5   9
b  3  7  11
c  0  0   0
英文:

You must use df.index.duplicated if you use .loc and not df.index.drop_duplicates() because the latter returns a list of unique indexes ['a', 'b'] so .loc returns all rows where the index is found in this list. In fact, df.index.drop_duplicates has the same behavior as df.index.unique() here:

# Sample (slightly modified)
>>> df = pd.DataFrame({'A': [1,2,3,4,0], 'B': [5,6,7,8,0], 'C': [9,10,11,12,0]}, 
                      index=['a', 'a', 'b', 'b', 'c'])

# Same as df.index.unique()
>>> df.index.drop_duplicates()
Index(['a', 'b', 'c'], dtype='object')

# Same as df.loc[df.index.unique()]
>>> df.loc[df.index.drop_duplicates(keep='first')]
   A  B   C
a  1  5   9
a  2  6  10
b  3  7  11
b  4  8  12
c  0  0   0

>>> df.loc[~df.index.duplicated(keep='first')]
   A  B   C
a  1  5   9
b  3  7  11
c  0  0   0

huangapple
  • 本文由 发表于 2023年6月8日 11:57:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76428522.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定