Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

huangapple go评论63阅读模式
英文:

Assign Group Number to Dataframe - Matching across two columns

问题

在下面的DataFrame中,我想根据两列之间的共享值(无论顺序如何)来分配一个组号。

data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])

期望的输出会分配一个新列,如下所示:

| Value_1 | Value_2 | Group |
|:--: |:---:|:---:|
| 1  | 10 | 1 |
| 1  | 15 | 1 |
| 0  | 15 | 1 |
| 4  | 0 | 1 |
| 2  | 3 | 2 |

组号的分配基于在Value_1或Value_2列中共享的值,即使是相邻行中的其他数字也会分配到相同的组号。例如,数字2和3在表格中的Value_1或Value_2列中都没有出现,所以它们被分配到一个新的组。而数字1和15出现多次,无论相邻行中的其他数字如何,都被分配到相同的组号。同样,数字0在Value_1和Value_2列中都出现,通过15与Group 1关联。

英文:

In a df such as the one below, I would like to assign a group number based on shared values across two columns - regardless of order

data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])
Value_1 Value_2
1 10
1 15
0 15
4 0
2 3

Desired output would assign a new column with a group number as such:

Value_1 Value_2 Group
1 10 1
1 15 1
0 15 1
4 0 1
2 3 2

Group number is such that values shared in either the Value_1 or Value_2 column signal the same group value. i.e. since both 2 and 3 are not found elsewhere in the table in the Value_1 or Value_2 column, they are assigned to a new group. Whereas 1 and 15 are found multiple times. Regardless of the other number in the adjacent row, it is assigned to the same group number. Likewise 0 is found in a Value_1 and a Value_2 column and is linked to Group 1 via the 15

答案1

得分: 0

I have the following working as desired:

def consolidate(sets):
    setlist = 
展开收缩
for i, s1 in enumerate(setlist): if s1: for s2 in setlist[i+1:]: intersection = s1.intersection(s2) if intersection: s2.update(s1) s1.clear() s1 = s2 return
展开收缩
def group_ids(pairs): groups = consolidate(map(set, pairs)) d = {} for i, group in enumerate(sorted(groups)): for elem in group: d[elem] = i return d df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))

Please note that the code remains in English as requested.

英文:

I have the following working as desired

def consolidate(sets):
    setlist = 
展开收缩
for i, s1 in enumerate(setlist): if s1: for s2 in setlist[i+1:]: intersection = s1.intersection(s2) if intersection: s2.update(s1) s1.clear() s1 = s2 return
展开收缩
def group_ids(pairs): groups = consolidate(map(set, pairs)) d = {} for i, group in enumerate(sorted(groups)): for elem in group: d[elem] = i return d df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))

答案2

得分: 0

You can use networkx to automate the connected component search:

import networkx as nx

G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')

mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}

df['Group'] = df['Value_1'].map(mapper)

Output:

   Value_1  Value_2  Group
0        1       10      1
1        1       15      1
2        0       15      1
3        4        0      1
4        2        3      2

You can generalize to any number of columns, using melt:

df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
                   'Value_2': [10, 15, 15, 0, 3],
                   'Value_3': [20, 21, 22, 23, 23]
                  })
cols = ['Value_1', 'Value_2', 'Value_3']

G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
                            source=cols[0], target='value')

mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}

df['Group'] = df[cols[0]].map(mapper)

Output:

   Value_1  Value_2  Value_3  Group
0        9       10       20      1
1        1       15       21      2
2        0       15       22      2
3        4        0       23      2
4        2        3       23      2
英文:

You can use networkx to automate the connected component search:

import networkx as nx

G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')

mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}

df['Group'] = df['Value_1'].map(mapper)

Output:

   Value_1  Value_2  Group
0        1       10      1
1        1       15      1
2        0       15      1
3        4        0      1
4        2        3      2

Graph:

Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

generalization

You can generalize to any number of columns, using melt:

df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
                   'Value_2': [10, 15, 15, 0, 3],
                   'Value_3': [20, 21, 22, 23, 23]
                  })
cols = ['Value_1', 'Value_2', 'Value_3']

G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
                            source=cols[0], target='value')

mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
          for node in sub}

df['Group'] = df[cols[0]].map(mapper)

Output:

   Value_1  Value_2  Value_3  Group
0        9       10       20      1
1        1       15       21      2
2        0       15       22      2
3        4        0       23      2
4        2        3       23      2

Graph:

Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

huangapple
  • 本文由 发表于 2023年6月21日 23:17:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76524847.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定