2023年2月24日 06:02:06go评论113阅读模式

英文:

Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise

问题

你可以使用pandas库的drop_duplicates方法来实现你想要的功能。以下是代码示例，可以根据你的需求删除重复的值：

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']
# Drop duplicates based on the columns specified in 'check' and keep the first occurrence.
df.drop_duplicates(subset=check, keep='first', inplace=True)
# Reset the index of the DataFrame after dropping duplicates
df.reset_index(drop=True, inplace=True)

这将删除列x3、x4、x5、x6、x7、v、y、ay、by、cy、gy、uap和ubp中的重复值，并保留第一次出现的值。然后，重新设置DataFrame的索引，以使索引连续。

英文:

I have the following pandas dataframe, which is over 7 million rows

import pandas as pd
data = {&#39;date&#39;: [&#39;2023-02-22&#39;, &#39;2023-02-21&#39;, &#39;2023-02-23&#39;],
        &#39;x1&#39;: [&#39;descx1a&#39;, &#39;descx1b&#39;, &#39;descx1c&#39;],
        &#39;x2&#39;: [&#39;ALSFNHF950&#39;, &#39;KLUGUIF615&#39;, np.nan],
        &#39;x3&#39;: [np.nan, np.nan, 24319.4],
        &#39;x4&#39;: [np.nan, np.nan, 24334.15],
        &#39;x5&#39;: [np.nan, np.nan, 24040.11],
        &#39;x6&#39;: [np.nan, 75.33, 24220.34],
        &#39;x7&#39;: [np.nan, np.nan, np.nan],
        &#39;v&#39;: [np.nan, np.nan, np.nan],
        &#39;y&#39;: [404.29, np.nan, np.nan],
        &#39;ay&#39;: [np.nan, np.nan, np.nan],
        &#39;by&#39;: [np.nan, np.nan, np.nan],
        &#39;cy&#39;: [np.nan, np.nan, np.nan],
        &#39;gy&#39;: [np.nan, np.nan, np.nan],
        &#39;uap&#39;: [404.29, 75.33, np.nan],
        &#39;ubp&#39;: [404.29, 75.33, np.nan],
        &#39;sf&#39;: [np.nan, 2.0, np.nan]}
df = pd.DataFrame(data)

If there are any duplicates of a number in any of the columns x3,x4,x5,x6,x7,v,y,ay,by,cy,gy,uap,ubp, I want to to delete the duplicates and only keep one copy, either the one in column x6 or the first column in which the duplicate appears.

In most rows the first copy if there are copies appear in column x6.

The output should look like this,

data = {&#39;date&#39;: [&#39;2023-02-22&#39;, &#39;2023-02-21&#39;, &#39;2023-02-23&#39;],
        &#39;x1&#39;: [&#39;descx1a&#39;, &#39;descx1b&#39;, &#39;descx1c&#39;],
        &#39;x2&#39;: [&#39;ALSFNHF950&#39;, &#39;KLUGUIF615&#39;, np.nan],
        &#39;x3&#39;: [np.nan, np.nan, 24319.4],
        &#39;x4&#39;: [np.nan, np.nan, 24334.15],
        &#39;x5&#39;: [np.nan, np.nan, 24040.11],
        &#39;x6&#39;: [np.nan, 75.33, 24220.34],
        &#39;x7&#39;: [np.nan, np.nan, np.nan],
        &#39;v&#39;: [np.nan, np.nan, np.nan],
        &#39;y&#39;: [404.29, np.nan, np.nan],
        &#39;ay&#39;: [np.nan, np.nan, np.nan],
        &#39;by&#39;: [np.nan, np.nan, np.nan],
        &#39;cy&#39;: [np.nan, np.nan, np.nan],
        &#39;gy&#39;: [np.nan, np.nan, np.nan],
        &#39;uap&#39;: [np.nan, np.nan, np.nan],
        &#39;ubp&#39;: [np.nan, np.nan, np.nan],
        &#39;sf&#39;: [np.nan, 2.0, np.nan]}

So far I only figured out,

check = [&#39;x3&#39;, &#39;x4&#39;, &#39;x5&#39;, &#39;x6&#39;, &#39;x7&#39;, &#39;v&#39;, &#39;y&#39;, &#39;ay&#39;, &#39;by&#39;, &#39;cy&#39;, &#39;gy&#39;, &#39;uap&#39;, &#39;ubp&#39;]
df[check] = df[check].where(~df[check].duplicated(), np.nan)

But it's wrong.

Is there a way to get this done?

答案1

得分: 2

尝试这个：

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']
df.loc[:, check] = df.loc[:, check].mask(df.loc[:, check].apply(pd.Series.duplicated, axis=1))
print(df)

英文:

try this:

check = [&#39;x3&#39;, &#39;x4&#39;, &#39;x5&#39;, &#39;x6&#39;, &#39;x7&#39;, &#39;v&#39;, &#39;y&#39;, &#39;ay&#39;, &#39;by&#39;, &#39;cy&#39;, &#39;gy&#39;, &#39;uap&#39;, &#39;ubp&#39;]
df.loc[:, check] = df.loc[:, check].mask(df.loc[:, check].apply(pd.Series.duplicated, axis=1))
print(df)
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在每行中去除子列中的重复项，仅保留第一个副本，逐行操作。

问题

答案1

如何优化这段NumPy代码以提高速度？

将稀疏矩阵写入压缩的gzip文件。

Git push heroku main命令错误，pywin32错误。

单机调度 – 截止日期约束

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。