英文:
Remove duplicated values appear in two columns in dataframe
问题
I have a table similar to this one:
我有一个类似这样的表格:
As you can see, the table kind of showing relationship between entities -
如您所见,表格在一定程度上显示了实体之间的关系 -
The table is actually showing overlap data, meaning, Hari and Wili for example, have the same document, and I would like to remove one of them not to have duplicated files.
实际上,这个表格显示了重叠的数据,意味着Hari和Wili,例如,拥有相同的文件,我想删除其中一个,以避免重复的文件。
In order to do this, I would like to create a new table that has only one name in the relationship, so I can later create a list of paths to remove.
为了做到这一点,我想创建一个新表,其中关系中只有一个名称,这样我以后可以创建一个要删除的路径列表。
The result table will look like this:
结果表格将如下所示:
The idea is that I'll use the values of "path2" to remove files with this path and will still have the files in path1.
这个想法是我将使用"path2"的值来删除具有此路径的文件,仍然会在路径1中保留文件。
For that reason, this line:
因此,这一行:
4 Lin path/to/lin Dan path/to/dan
is missing, as it will be removed using Miko...
缺失了,因为它将被使用Miko删除...
Any ideas how to do this?
有没有关于如何做到这一点的想法?
Edit:
编辑:
I have tried this based on this answer:
我已经尝试过这个,基于这个答案:
df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
And it's true that I get fewer rows in my dataframe (it has 695, and I got now 402), but I still have the first lines like this:
的确,我在我的数据框中得到了更少的行(原本有695行,现在有402行),但我仍然有第一行像这样的问题:
0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy
meaning I still get the same issue.
这意味着我仍然遇到相同的问题。
英文:
I have table similar to this one:
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy
2 Hari path/to/Hari Wili path/to/Wili
3 Wili path/to/Wili Hari path/to/Hari
4 Miko path/to/miko Lin path/to/lin
5 Miko path/to/miko Dan path/to/dan
6 Lin path/to/lin Miko path/to/miko
7 Lin path/to/lin Dan path/to/dan
8 Dan path/to/dan Miko path/to/miko
9 Dan path/to/dan Lin path/to/lin
...
As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.
The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files.
In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.
The result table will look like this :
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Hari path/to/Hari Wili path/to/Wili
2 Miko path/to/miko Lin path/to/lin
3 Miko path/to/miko Dan path/to/dan
The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1.
for that reason,
this line:
4 Lin path/to/lin Dan path/to/dan
is missing, as it will be removed using Miko...
any ideas how to do this ?
Edit:
I have tried this based on this answer:
df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:
index name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy
...
meaning I still get the same issue
答案1
得分: 4
可以使用frozenset
检测重复项:
out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
.loc[lambda x: ~x['path2'].isin(x['path1'])])
# OR
out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
.query('~path1.isin(path2)'))
输出:
>>> out
name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
2 Hari path/to/Hari Wili path/to/Wili
5 Miko path/to/miko Dan path/to/dan
7 Lin path/to/lin Dan path/to/dan
英文:
You can use frozenset
to detect duplicates:
out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
.loc[lambda x: ~x['path2'].isin(x['path1'])])
# OR
out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
.query('~path1.isin(path2)'))
Output:
>>> out
name_1 path1 name_2 path2
0 Roy path/to/Roy Anne path/to/Anne
2 Hari path/to/Hari Wili path/to/Wili
5 Miko path/to/miko Dan path/to/dan
7 Lin path/to/lin Dan path/to/dan
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论