2023年5月11日 16:36:24go评论92阅读模式

英文:

Remove duplicated values appear in two columns in dataframe

问题

I have a table similar to this one:

我有一个类似这样的表格：

As you can see, the table kind of showing relationship between entities -

如您所见，表格在一定程度上显示了实体之间的关系 -

The table is actually showing overlap data, meaning, Hari and Wili for example, have the same document, and I would like to remove one of them not to have duplicated files.

实际上，这个表格显示了重叠的数据，意味着Hari和Wili，例如，拥有相同的文件，我想删除其中一个，以避免重复的文件。

In order to do this, I would like to create a new table that has only one name in the relationship, so I can later create a list of paths to remove.

为了做到这一点，我想创建一个新表，其中关系中只有一个名称，这样我以后可以创建一个要删除的路径列表。

The result table will look like this:

结果表格将如下所示：

The idea is that I'll use the values of "path2" to remove files with this path and will still have the files in path1.

这个想法是我将使用"path2"的值来删除具有此路径的文件，仍然会在路径1中保留文件。

For that reason, this line:

因此，这一行：

4 Lin path/to/lin Dan path/to/dan

is missing, as it will be removed using Miko...

缺失了，因为它将被使用Miko删除...

Any ideas how to do this?

有没有关于如何做到这一点的想法？

Edit:

编辑：

I have tried this based on this answer:

我已经尝试过这个，基于这个答案：

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get fewer rows in my dataframe (it has 695, and I got now 402), but I still have the first lines like this:

的确，我在我的数据框中得到了更少的行（原本有695行，现在有402行），但我仍然有第一行像这样的问题：

0 Roy path/to/Roy Anne path/to/Anne
1 Anne path/to/Anne Roy path/to/Roy

meaning I still get the same issue.

这意味着我仍然遇到相同的问题。

英文:

I have table similar to this one:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
2       Hari      path/to/Hari    Wili      path/to/Wili
3       Wili      path/to/Wili    Hari      path/to/Hari
4       Miko      path/to/miko     Lin      path/to/lin
5       Miko      path/to/miko     Dan      path/to/dan
6       Lin       path/to/lin     Miko      path/to/miko
7       Lin       path/to/lin     Dan       path/to/dan
8       Dan       path/to/dan     Miko      path/to/miko
9       Dan       path/to/dan     Lin       path/to/lin
...

As you can see, the table kind of showing relationship between entities -
Roi is with Anne,
Wili with Hari,
Lin with Dan and with Miko.

The table is actually showing overlap data , meaning, Hari and wili for example, have the same document, and I would like to remove one of them not to have duplicated files.
In order to do this, I would like to create new table that has only one name in relationship, so I can later create list of paths to remove.

The result table will look like this :

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy      Anne      path/to/Anne
1       Hari      path/to/Hari     Wili      path/to/Wili
2       Miko      path/to/miko     Lin       path/to/lin
3       Miko      path/to/miko     Dan       path/to/dan

The idea is that I'll use the values of "path2" to remove files with this path, and will still have the files in path1.
for that reason,
this line:

4       Lin       path/to/lin    Dan       path/to/dan

is missing, as it will be removed using Miko...
any ideas how to do this ?

Edit:

I have tried this based on this answer:

df_2= df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]

And it's true that I get less rows in my dataframe (it has 695 and I got now 402) , but, I still have the first lines like this:

index   name_1     path1        name_2       path2
0       Roy       path/to/Roy     Anne      path/to/Anne
1       Anne      path/to/Anne     Roy      path/to/Roy 
...

meaning I still get the same issue

答案1

得分: 4

可以使用frozenset检测重复项：

out = (df[~df[['name_1', 'name_2']].agg(frozenset, axis=1).duplicated()]
           .loc[lambda x: ~x['path2'].isin(x['path1'])])
# OR
out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
           .query('~path1.isin(path2)'))

输出：

&gt;&gt;&gt; out
  name_1         path1 name_2         path2
0    Roy   path/to/Roy   Anne  path/to/Anne
2   Hari  path/to/Hari   Wili  path/to/Wili
5   Miko  path/to/miko    Dan   path/to/dan
7    Lin   path/to/lin    Dan   path/to/dan

英文:

You can use frozenset to detect duplicates:

out = (df[~df[[&#39;name_1&#39;, &#39;name_2&#39;]].agg(frozenset, axis=1).duplicated()]
           .loc[lambda x: ~x[&#39;path2&#39;].isin(x[&#39;path1&#39;])])
# OR
out = (df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
           .query(&#39;~path1.isin(path2)&#39;))

Output:

&gt;&gt;&gt; out
  name_1         path1 name_2         path2
0    Roy   path/to/Roy   Anne  path/to/Anne
2   Hari  path/to/Hari   Wili  path/to/Wili
5   Miko  path/to/miko    Dan   path/to/dan
7    Lin   path/to/lin    Dan   path/to/dan

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Remove duplicated values appear in two columns in dataframe

问题

答案1

如何强制 ElementTree 在特定目录中查找 XML 文件？

在字典定义中结合理解和键-值列表是否可能？

Specifying a different input type for a Pydantic model field (comma-separated string input as a list of strings)

如何在 Polars 中使用随机值填充列

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。