问题

我正在预处理我的数据集，并发现存在一些重复的索引。
我尝试使用.drop_duplicates()函数解决这个问题，但失败了。
然后我转向.duplicated()，它起作用了...但我不明白它们之间的区别。
下面是一个简单的示例。

一个简单的示例

我打印df.index.drop_duplicates(keep="first")，它是正确的，但在.loc之后，重复的索引仍然存在。
有人可以帮忙解释这里的区别吗？非常感谢！

英文:

I'm preprocessing my dataset and find that there are some duplicated indexes.
I've tried to solve this problem with .drop_duplicates() function but failed.
And I turn to .duplicated() and it works...but I dont understand the difference between them.
The figure is a simple example.

a simple example

I print(df.index.drop_duplicates(keep="first")) and it is correct, but after .loc the duplicated indexes still exist.
Could anyone help to explain the difference here please, thanks a lot!

答案1

得分: 1

你必须使用df.index.duplicated如果你使用.loc而不是df.index.drop_duplicates()，因为后者返回唯一索引的列表['a', 'b']，所以.loc返回所有索引在此列表中找到的行。实际上，df.index.drop_duplicates在这里的行为与df.index.unique()相同：

# 示例（稍作修改）
>>> df = pd.DataFrame({'A': [1,2,3,4,0], 'B': [5,6,7,8,0], 'C': [9,10,11,12,0]}, 
                      index=['a', 'a', 'b', 'b', 'c'])

# 与 df.index.unique() 相同
>>> df.index.drop_duplicates()
Index(['a', 'b', 'c'], dtype='object')

# 与 df.loc[df.index.unique()] 相同
>>> df.loc[df.index.drop_duplicates(keep='first')]
   A  B   C
a  1  5   9
a  2  6  10
b  3  7  11
b  4  8  12
c  0  0   0

>>> df.loc[~df.index.duplicated(keep='first')]
   A  B   C
a  1  5   9
b  3  7  11
c  0  0   0

英文:

You must use df.index.duplicated if you use .loc and not df.index.drop_duplicates() because the latter returns a list of unique indexes ['a', 'b'] so .loc returns all rows where the index is found in this list. In fact, df.index.drop_duplicates has the same behavior as df.index.unique() here:

# Sample (slightly modified)
&gt;&gt;&gt; df = pd.DataFrame({&#39;A&#39;: [1,2,3,4,0], &#39;B&#39;: [5,6,7,8,0], &#39;C&#39;: [9,10,11,12,0]}, 
                      index=[&#39;a&#39;, &#39;a&#39;, &#39;b&#39;, &#39;b&#39;, &#39;c&#39;])

# Same as df.index.unique()
&gt;&gt;&gt; df.index.drop_duplicates()
Index([&#39;a&#39;, &#39;b&#39;, &#39;c&#39;], dtype=&#39;object&#39;)

# Same as df.loc[df.index.unique()]
&gt;&gt;&gt; df.loc[df.index.drop_duplicates(keep=&#39;first&#39;)]
   A  B   C
a  1  5   9
a  2  6  10
b  3  7  11
b  4  8  12
c  0  0   0

&gt;&gt;&gt; df.loc[~df.index.duplicated(keep=&#39;first&#39;)]
   A  B   C
a  1  5   9
b  3  7  11
c  0  0   0

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

DataFrame中删除重复索引的问题

问题

答案1

在Python中无法读取.mat文件。

如何使用`scipy`中的`interp1d(x, y)`函数插值月度频率样本数据的缺失值

Django: 模板语法错误 – 无法解析余下的部分

Pandas DataFrame: 分类数据类型到日期时间

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论