Remove duplicated values appear in two columns in dataframe

huangapple go评论92阅读模式
英文:

Remove duplicated values appear in two columns in dataframe

问题

I have a table similar to this one:

我有一个类似这样的表格:

As you can see, the table kind of showing relationship between entities -

如您所见,表格在一定程度上显示了实体之间的关系 -

The table is actually showing overlap data, meaning, Hari and Wili for example, have the same document, and I would like to remove one of them not to have duplicated files.

实际上,这个表格显示了重叠的数据,意味着Hari和Wili,例如,拥有相同的文件,我想删除其中一个,以避免重复的文件。

In order to do this, I would like to create a new table that has only one name in the relationship, so I can later create a list of paths to remove.

为了做到这一点,我想创建一个新表,其中关系中只有一个名称,这样我以后可以创建一个要删除的路径列表。

The result table will look like this:

结果表格将如下所示:

The idea is that I'll use the values of "path2" to remove files with this path and will still have the files in path1.

这个想法是我将使用"path2"的值来删除具有此路径的文件,仍然会在路径1中保留文件。

For that reason, this line:

因此,这一行:

4 Lin path/to/lin Dan path/to/dan

is missing, as it will be removed using Miko...

缺失了,因为它将被使用Miko删除...

Any ideas how to do this? Remove duplicated values appear in two columns in dataframe

有没有关于如何做到这一点的想法? Remove duplicated values appear in two columns in dataframe

Edit:

编辑:

I have tried this based on this answer:

我已经尝试过这个,基于这个答案:

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get fewer rows in my dataframe (it has 695, and I got now 402), but I still have the first lines like this:

的确,我在我的数据框中得到了更少的行(原本有695行,现在有402行),但我仍然有第一行像这样的问题:

0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy

meaning I still get the same issue.

这意味着我仍然遇到相同的问题。

英文:

I have table similar to this one:

  1. index name_1 path1 name_2 path2
  2. 0 Roy path/to/Roy Anne path/to/Anne
  3. 1 Anne path/to/Anne Roy path/to/Roy
  4. 2 Hari path/to/Hari Wili path/to/Wili
  5. 3 Wili path/to/Wili Hari path/to/Hari
  6. 4 Miko path/to/miko Lin path/to/lin
  7. 5 Miko path/to/miko Dan path/to/dan
  8. 6 Lin path/to/lin Miko path/to/miko
  9. 7 Lin path/to/lin Dan path/to/dan
  10. 8 Dan path/to/dan Miko path/to/miko
  11. 9 Dan path/to/dan Lin path/to/lin
  12. ...

As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.

The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files.
In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.

The result table will look like this :

  1. index name_1 path1 name_2 path2
  2. 0 Roy path/to/Roy Anne path/to/Anne
  3. 1 Hari path/to/Hari Wili path/to/Wili
  4. 2 Miko path/to/miko Lin path/to/lin
  5. 3 Miko path/to/miko Dan path/to/dan

The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1.
for that reason,
this line:

  1. 4 Lin path/to/lin Dan path/to/dan

is missing, as it will be removed using Miko...
any ideas how to do this ? Remove duplicated values appear in two columns in dataframe

Edit:

I have tried this based on this answer:

  1. df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:

  1. index name_1 path1 name_2 path2
  2. 0 Roy path/to/Roy Anne path/to/Anne
  3. 1 Anne path/to/Anne Roy path/to/Roy
  4. ...

meaning I still get the same issue

答案1

得分: 4

可以使用frozenset检测重复项:

  1. out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
  2. .loc[lambda x: ~x['path2'].isin(x['path1'])])
  3. # OR
  4. out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
  5. .query('~path1.isin(path2)'))

输出:

  1. >>> out
  2. name_1 path1 name_2 path2
  3. 0 Roy path/to/Roy Anne path/to/Anne
  4. 2 Hari path/to/Hari Wili path/to/Wili
  5. 5 Miko path/to/miko Dan path/to/dan
  6. 7 Lin path/to/lin Dan path/to/dan
英文:

You can use frozenset to detect duplicates:

  1. out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
  2. .loc[lambda x: ~x['path2'].isin(x['path1'])])
  3. # OR
  4. out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
  5. .query('~path1.isin(path2)'))

Output:

  1. >>> out
  2. name_1 path1 name_2 path2
  3. 0 Roy path/to/Roy Anne path/to/Anne
  4. 2 Hari path/to/Hari Wili path/to/Wili
  5. 5 Miko path/to/miko Dan path/to/dan
  6. 7 Lin path/to/lin Dan path/to/dan

huangapple
  • 本文由 发表于 2023年5月11日 16:36:24
  • 转载请务必保留本文链接:https://go.coder-hub.com/76225668.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定