Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

huangapple go评论88阅读模式
英文:

Assign Group Number to Dataframe - Matching across two columns

问题

在下面的DataFrame中,我想根据两列之间的共享值(无论顺序如何)来分配一个组号。

  1. data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
  2. df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])

期望的输出会分配一个新列,如下所示:

  1. | Value_1 | Value_2 | Group |
  2. |:--: |:---:|:---:|
  3. | 1 | 10 | 1 |
  4. | 1 | 15 | 1 |
  5. | 0 | 15 | 1 |
  6. | 4 | 0 | 1 |
  7. | 2 | 3 | 2 |

组号的分配基于在Value_1或Value_2列中共享的值,即使是相邻行中的其他数字也会分配到相同的组号。例如,数字2和3在表格中的Value_1或Value_2列中都没有出现,所以它们被分配到一个新的组。而数字1和15出现多次,无论相邻行中的其他数字如何,都被分配到相同的组号。同样,数字0在Value_1和Value_2列中都出现,通过15与Group 1关联。

英文:

In a df such as the one below, I would like to assign a group number based on shared values across two columns - regardless of order

  1. data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
  2. df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])
Value_1 Value_2
1 10
1 15
0 15
4 0
2 3

Desired output would assign a new column with a group number as such:

Value_1 Value_2 Group
1 10 1
1 15 1
0 15 1
4 0 1
2 3 2

Group number is such that values shared in either the Value_1 or Value_2 column signal the same group value. i.e. since both 2 and 3 are not found elsewhere in the table in the Value_1 or Value_2 column, they are assigned to a new group. Whereas 1 and 15 are found multiple times. Regardless of the other number in the adjacent row, it is assigned to the same group number. Likewise 0 is found in a Value_1 and a Value_2 column and is linked to Group 1 via the 15

答案1

得分: 0

I have the following working as desired:

  1. def consolidate(sets):
  2. setlist =
    展开收缩
  3. for i, s1 in enumerate(setlist):
  4. if s1:
  5. for s2 in setlist[i+1:]:
  6. intersection = s1.intersection(s2)
  7. if intersection:
  8. s2.update(s1)
  9. s1.clear()
  10. s1 = s2
  11. return
    展开收缩
  12. def group_ids(pairs):
  13. groups = consolidate(map(set, pairs))
  14. d = {}
  15. for i, group in enumerate(sorted(groups)):
  16. for elem in group:
  17. d[elem] = i
  18. return d
  19. df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))

Please note that the code remains in English as requested.

英文:

I have the following working as desired

  1. def consolidate(sets):
  2. setlist =
    展开收缩
  3. for i, s1 in enumerate(setlist):
  4. if s1:
  5. for s2 in setlist[i+1:]:
  6. intersection = s1.intersection(s2)
  7. if intersection:
  8. s2.update(s1)
  9. s1.clear()
  10. s1 = s2
  11. return
    展开收缩
  12. def group_ids(pairs):
  13. groups = consolidate(map(set, pairs))
  14. d = {}
  15. for i, group in enumerate(sorted(groups)):
  16. for elem in group:
  17. d[elem] = i
  18. return d
  19. df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))

答案2

得分: 0

You can use networkx to automate the connected component search:

  1. import networkx as nx
  2. G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')
  3. mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
  4. for node in sub}
  5. df['Group'] = df['Value_1'].map(mapper)

Output:

  1. Value_1 Value_2 Group
  2. 0 1 10 1
  3. 1 1 15 1
  4. 2 0 15 1
  5. 3 4 0 1
  6. 4 2 3 2

You can generalize to any number of columns, using melt:

  1. df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
  2. 'Value_2': [10, 15, 15, 0, 3],
  3. 'Value_3': [20, 21, 22, 23, 23]
  4. })
  5. cols = ['Value_1', 'Value_2', 'Value_3']
  6. G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
  7. source=cols[0], target='value')
  8. mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
  9. for node in sub}
  10. df['Group'] = df[cols[0]].map(mapper)

Output:

  1. Value_1 Value_2 Value_3 Group
  2. 0 9 10 20 1
  3. 1 1 15 21 2
  4. 2 0 15 22 2
  5. 3 4 0 23 2
  6. 4 2 3 23 2
英文:

You can use networkx to automate the connected component search:

  1. import networkx as nx
  2. G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')
  3. mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
  4. for node in sub}
  5. df['Group'] = df['Value_1'].map(mapper)

Output:

  1. Value_1 Value_2 Group
  2. 0 1 10 1
  3. 1 1 15 1
  4. 2 0 15 1
  5. 3 4 0 1
  6. 4 2 3 2

Graph:

Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

generalization

You can generalize to any number of columns, using melt:

  1. df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
  2. 'Value_2': [10, 15, 15, 0, 3],
  3. 'Value_3': [20, 21, 22, 23, 23]
  4. })
  5. cols = ['Value_1', 'Value_2', 'Value_3']
  6. G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
  7. source=cols[0], target='value')
  8. mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
  9. for node in sub}
  10. df['Group'] = df[cols[0]].map(mapper)

Output:

  1. Value_1 Value_2 Value_3 Group
  2. 0 9 10 20 1
  3. 1 1 15 21 2
  4. 2 0 15 22 2
  5. 3 4 0 23 2
  6. 4 2 3 23 2

Graph:

Assign Group Number to Dataframe – 在两列之间进行匹配时分配组号

huangapple
  • 本文由 发表于 2023年6月21日 23:17:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76524847.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定