在每行中去除子列中的重复项,仅保留第一个副本,逐行操作。

huangapple go评论62阅读模式
英文:

Drop duplicates in a subset of columns per row, rowwise, only keeping the first copy, rowwise

问题

你可以使用pandas库的drop_duplicates方法来实现你想要的功能。以下是代码示例,可以根据你的需求删除重复的值:

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']

# Drop duplicates based on the columns specified in 'check' and keep the first occurrence.
df.drop_duplicates(subset=check, keep='first', inplace=True)

# Reset the index of the DataFrame after dropping duplicates
df.reset_index(drop=True, inplace=True)

这将删除列x3x4x5x6x7vyaybycygyuapubp中的重复值,并保留第一次出现的值。然后,重新设置DataFrame的索引,以使索引连续。

英文:

I have the following pandas dataframe, which is over 7 million rows

import pandas as pd

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [np.nan, 75.33, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, np.nan, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [404.29, 75.33, np.nan],
        'ubp': [404.29, 75.33, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

df = pd.DataFrame(data)

If there are any duplicates of a number in any of the columns x3,x4,x5,x6,x7,v,y,ay,by,cy,gy,uap,ubp, I want to to delete the duplicates and only keep one copy, either the one in column x6 or the first column in which the duplicate appears.

In most rows the first copy if there are copies appear in column x6.

The output should look like this,

data = {'date': ['2023-02-22', '2023-02-21', '2023-02-23'],
        'x1': ['descx1a', 'descx1b', 'descx1c'],
        'x2': ['ALSFNHF950', 'KLUGUIF615', np.nan],
        'x3': [np.nan, np.nan, 24319.4],
        'x4': [np.nan, np.nan, 24334.15],
        'x5': [np.nan, np.nan, 24040.11],
        'x6': [np.nan, 75.33, 24220.34],
        'x7': [np.nan, np.nan, np.nan],
        'v': [np.nan, np.nan, np.nan],
        'y': [404.29, np.nan, np.nan],
        'ay': [np.nan, np.nan, np.nan],
        'by': [np.nan, np.nan, np.nan],
        'cy': [np.nan, np.nan, np.nan],
        'gy': [np.nan, np.nan, np.nan],
        'uap': [np.nan, np.nan, np.nan],
        'ubp': [np.nan, np.nan, np.nan],
        'sf': [np.nan, 2.0, np.nan]}

So far I only figured out,

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']

df[check] = df[check].where(~df[check].duplicated(), np.nan)

But it's wrong.

Is there a way to get this done?

答案1

得分: 2

尝试这个:

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']
df.loc[:, check] = df.loc[:, check].mask(df.loc[:, check].apply(pd.Series.duplicated, axis=1))
print(df)
英文:

try this:

check = ['x3', 'x4', 'x5', 'x6', 'x7', 'v', 'y', 'ay', 'by', 'cy', 'gy', 'uap', 'ubp']
df.loc[:, check] = df.loc[:, check].mask(df.loc[:, check].apply(pd.Series.duplicated, axis=1))
print(df)

</details>



huangapple
  • 本文由 发表于 2023年2月24日 06:02:06
  • 转载请务必保留本文链接:https://go.coder-hub.com/75550770.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定