2023年6月21日 23:17:45go评论88阅读模式

英文:

Assign Group Number to Dataframe - Matching across two columns

问题

在下面的DataFrame中，我想根据两列之间的共享值（无论顺序如何）来分配一个组号。

data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])

期望的输出会分配一个新列，如下所示：

| Value_1 | Value_2 | Group |
|:--: |:---:|:---:|
| 1  | 10 | 1 |
| 1  | 15 | 1 |
| 0  | 15 | 1 |
| 4  | 0 | 1 |
| 2  | 3 | 2 |

组号的分配基于在Value_1或Value_2列中共享的值，即使是相邻行中的其他数字也会分配到相同的组号。例如，数字2和3在表格中的Value_1或Value_2列中都没有出现，所以它们被分配到一个新的组。而数字1和15出现多次，无论相邻行中的其他数字如何，都被分配到相同的组号。同样，数字0在Value_1和Value_2列中都出现，通过15与Group 1关联。

英文:

In a df such as the one below, I would like to assign a group number based on shared values across two columns - regardless of order

data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=[&#39;Value_1&#39;, &#39;Value_2&#39;])

Value_1	Value_2
1	10
1	15
0	15
4	0
2	3

Desired output would assign a new column with a group number as such:

Value_1	Value_2	Group
1	10	1
1	15	1
0	15	1
4	0	1
2	3	2

Group number is such that values shared in either the Value_1 or Value_2 column signal the same group value. i.e. since both 2 and 3 are not found elsewhere in the table in the Value_1 or Value_2 column, they are assigned to a new group. Whereas 1 and 15 are found multiple times. Regardless of the other number in the adjacent row, it is assigned to the same group number. Likewise 0 is found in a Value_1 and a Value_2 column and is linked to Group 1 via the 15

答案1

得分: 0

I have the following working as desired:

def consolidate(sets):
    setlist = 展开收缩
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return 展开收缩
def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d
df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))

Please note that the code remains in English as requested.

英文:

I have the following working as desired

def consolidate(sets):
    setlist = 展开收缩
    for i, s1 in enumerate(setlist):
        if s1:
            for s2 in setlist[i+1:]:
                intersection = s1.intersection(s2)
                if intersection:
                    s2.update(s1)
                    s1.clear()
                    s1 = s2
    return 展开收缩
def group_ids(pairs):
    groups = consolidate(map(set, pairs))
    d = {}
    for i, group in enumerate(sorted(groups)):
        for elem in group:
            d[elem] = i
    return d
df[&quot;C&quot;] = df[&quot;Value_1&quot;].replace(group_ids(zip(df.Value_1, df.Value_2)))

答案2

得分: 0

You can use networkx to automate the connected component search:

import networkx as nx
G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}
df['Group'] = df['Value_1'].map(mapper)

Output:

   Value_1  Value_2  Group
0        1       10      1
1        1       15      1
2        0       15      1
3        4        0      1
4        2        3      2

You can generalize to any number of columns, using melt:

df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
                   'Value_2': [10, 15, 15, 0, 3],
                   'Value_3': [20, 21, 22, 23, 23]
                  })
cols = ['Value_1', 'Value_2', 'Value_3']
G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
                            source=cols[0], target='value')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}
df['Group'] = df[cols[0]].map(mapper)

Output:

   Value_1  Value_2  Value_3  Group
0        9       10       20      1
1        1       15       21      2
2        0       15       22      2
3        4        0       23      2
4        2        3       23      2

英文:

You can use networkx to automate the connected component search:

import networkx as nx
G = nx.from_pandas_edgelist(df, source=&#39;Value_1&#39;, target=&#39;Value_2&#39;)
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}
df[&#39;Group&#39;] = df[&#39;Value_1&#39;].map(mapper)

Output:

   Value_1  Value_2  Group
0        1       10      1
1        1       15      1
2        0       15      1
3        4        0      1
4        2        3      2

Graph:

generalization

You can generalize to any number of columns, using melt:

df = pd.DataFrame({&#39;Value_1&#39;: [9, 1, 0, 4, 2],
                   &#39;Value_2&#39;: [10, 15, 15, 0, 3],
                   &#39;Value_3&#39;: [20, 21, 22, 23, 23]
                  })
cols = [&#39;Value_1&#39;, &#39;Value_2&#39;, &#39;Value_3&#39;]
G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
                            source=cols[0], target=&#39;value&#39;)
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}
df[&#39;Group&#39;] = df[cols[0]].map(mapper)

Output:

   Value_1  Value_2  Value_3  Group
0        9       10       20      1
1        1       15       21      2
2        0       15       22      2
3        4        0       23      2
4        2        3       23      2

Graph:

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

问题

答案1

答案2

generalization

生成新行并按顺序在R中填充它们。

有没有一种更简洁的方法来从我的R数据集中获取最早的诊断和代码？

为什么在遍历 R 数据框的列时比遍历等价向量花费更长时间？

pandas.series.str.replace无法使用从Excel加载的正则表达式模式工作。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

发表评论