Remove duplicated values appear in two columns in dataframe

huangapple go评论64阅读模式
英文:

Remove duplicated values appear in two columns in dataframe

问题

I have a table similar to this one:

我有一个类似这样的表格:

As you can see, the table kind of showing relationship between entities -

如您所见,表格在一定程度上显示了实体之间的关系 -

The table is actually showing overlap data, meaning, Hari and Wili for example, have the same document, and I would like to remove one of them not to have duplicated files.

实际上,这个表格显示了重叠的数据,意味着Hari和Wili,例如,拥有相同的文件,我想删除其中一个,以避免重复的文件。

In order to do this, I would like to create a new table that has only one name in the relationship, so I can later create a list of paths to remove.

为了做到这一点,我想创建一个新表,其中关系中只有一个名称,这样我以后可以创建一个要删除的路径列表。

The result table will look like this:

结果表格将如下所示:

The idea is that I'll use the values of "path2" to remove files with this path and will still have the files in path1.

这个想法是我将使用"path2"的值来删除具有此路径的文件,仍然会在路径1中保留文件。

For that reason, this line:

因此,这一行:

4 Lin path/to/lin Dan path/to/dan

is missing, as it will be removed using Miko...

缺失了,因为它将被使用Miko删除...

Any ideas how to do this? Remove duplicated values appear in two columns in dataframe

有没有关于如何做到这一点的想法? Remove duplicated values appear in two columns in dataframe

Edit:

编辑:

I have tried this based on this answer:

我已经尝试过这个,基于这个答案:

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get fewer rows in my dataframe (it has 695, and I got now 402), but I still have the first lines like this:

的确,我在我的数据框中得到了更少的行(原本有695行,现在有402行),但我仍然有第一行像这样的问题:

0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy

meaning I still get the same issue.

这意味着我仍然遇到相同的问题。

英文:

I have table similar to this one:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
2       Hari      path/to/Hari    Wili      path/to/Wili
3       Wili      path/to/Wili    Hari      path/to/Hari
4       Miko      path/to/miko     Lin      path/to/lin
5       Miko      path/to/miko     Dan      path/to/dan
6       Lin       path/to/lin     Miko      path/to/miko
7       Lin       path/to/lin     Dan       path/to/dan
8       Dan       path/to/dan     Miko      path/to/miko
9       Dan       path/to/dan     Lin       path/to/lin
...

As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.

The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files.
In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.

The result table will look like this :

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy      Anne      path/to/Anne
1       Hari      path/to/Hari     Wili      path/to/Wili
2       Miko      path/to/miko     Lin       path/to/lin
3       Miko      path/to/miko     Dan       path/to/dan

The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1.
for that reason,
this line:

4       Lin       path/to/lin    Dan       path/to/dan

is missing, as it will be removed using Miko...
any ideas how to do this ? Remove duplicated values appear in two columns in dataframe

Edit:

I have tried this based on this answer:

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
...

meaning I still get the same issue

答案1

得分: 4

可以使用frozenset检测重复项:

out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
           .loc[lambda x: ~x['path2'].isin(x['path1'])])

# OR

out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
           .query('~path1.isin(path2)'))

输出:

>>> out
  name_1         path1 name_2         path2
0    Roy   path/to/Roy   Anne  path/to/Anne
2   Hari  path/to/Hari   Wili  path/to/Wili
5   Miko  path/to/miko    Dan   path/to/dan
7    Lin   path/to/lin    Dan   path/to/dan
英文:

You can use frozenset to detect duplicates:

out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
           .loc[lambda x: ~x['path2'].isin(x['path1'])])

# OR

out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
           .query('~path1.isin(path2)'))

Output:

>>> out
  name_1         path1 name_2         path2
0    Roy   path/to/Roy   Anne  path/to/Anne
2   Hari  path/to/Hari   Wili  path/to/Wili
5   Miko  path/to/miko    Dan   path/to/dan
7    Lin   path/to/lin    Dan   path/to/dan

huangapple
  • 本文由 发表于 2023年5月11日 16:36:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76225668.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定