英文:
Assign Group Number to Dataframe - Matching across two columns
问题
在下面的DataFrame中,我想根据两列之间的共享值(无论顺序如何)来分配一个组号。
data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])
期望的输出会分配一个新列,如下所示:
| Value_1 | Value_2 | Group |
|:--: |:---:|:---:|
| 1 | 10 | 1 |
| 1 | 15 | 1 |
| 0 | 15 | 1 |
| 4 | 0 | 1 |
| 2 | 3 | 2 |
组号的分配基于在Value_1或Value_2列中共享的值,即使是相邻行中的其他数字也会分配到相同的组号。例如,数字2和3在表格中的Value_1或Value_2列中都没有出现,所以它们被分配到一个新的组。而数字1和15出现多次,无论相邻行中的其他数字如何,都被分配到相同的组号。同样,数字0在Value_1和Value_2列中都出现,通过15与Group 1关联。
英文:
In a df such as the one below, I would like to assign a group number based on shared values across two columns - regardless of order
data = [[1, 10], [1, 15], [0, 15], [4, 0], [2, 3]]
df = pd.DataFrame(data, columns=['Value_1', 'Value_2'])
Value_1 | Value_2 |
---|---|
1 | 10 |
1 | 15 |
0 | 15 |
4 | 0 |
2 | 3 |
Desired output would assign a new column with a group number as such:
Value_1 | Value_2 | Group |
---|---|---|
1 | 10 | 1 |
1 | 15 | 1 |
0 | 15 | 1 |
4 | 0 | 1 |
2 | 3 | 2 |
Group number is such that values shared in either the Value_1 or Value_2 column signal the same group value. i.e. since both 2 and 3 are not found elsewhere in the table in the Value_1 or Value_2 column, they are assigned to a new group. Whereas 1 and 15 are found multiple times. Regardless of the other number in the adjacent row, it is assigned to the same group number. Likewise 0 is found in a Value_1 and a Value_2 column and is linked to Group 1 via the 15
答案1
得分: 0
I have the following working as desired:
def consolidate(sets):
setlist = 展开收缩
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return 展开收缩
def group_ids(pairs):
groups = consolidate(map(set, pairs))
d = {}
for i, group in enumerate(sorted(groups)):
for elem in group:
d[elem] = i
return d
df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))
Please note that the code remains in English as requested.
英文:
I have the following working as desired
def consolidate(sets):
setlist = 展开收缩
for i, s1 in enumerate(setlist):
if s1:
for s2 in setlist[i+1:]:
intersection = s1.intersection(s2)
if intersection:
s2.update(s1)
s1.clear()
s1 = s2
return 展开收缩
def group_ids(pairs):
groups = consolidate(map(set, pairs))
d = {}
for i, group in enumerate(sorted(groups)):
for elem in group:
d[elem] = i
return d
df["C"] = df["Value_1"].replace(group_ids(zip(df.Value_1, df.Value_2)))
答案2
得分: 0
You can use networkx
to automate the connected component search:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
for node in sub}
df['Group'] = df['Value_1'].map(mapper)
Output:
Value_1 Value_2 Group
0 1 10 1
1 1 15 1
2 0 15 1
3 4 0 1
4 2 3 2
You can generalize to any number of columns, using melt
:
df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
'Value_2': [10, 15, 15, 0, 3],
'Value_3': [20, 21, 22, 23, 23]
})
cols = ['Value_1', 'Value_2', 'Value_3']
G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
source=cols[0], target='value')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
for node in sub}
df['Group'] = df[cols[0]].map(mapper)
Output:
Value_1 Value_2 Value_3 Group
0 9 10 20 1
1 1 15 21 2
2 0 15 22 2
3 4 0 23 2
4 2 3 23 2
英文:
You can use networkx
to automate the connected component search:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Value_1', target='Value_2')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
for node in sub}
df['Group'] = df['Value_1'].map(mapper)
Output:
Value_1 Value_2 Group
0 1 10 1
1 1 15 1
2 0 15 1
3 4 0 1
4 2 3 2
Graph:
generalization
You can generalize to any number of columns, using melt
:
df = pd.DataFrame({'Value_1': [9, 1, 0, 4, 2],
'Value_2': [10, 15, 15, 0, 3],
'Value_3': [20, 21, 22, 23, 23]
})
cols = ['Value_1', 'Value_2', 'Value_3']
G = nx.from_pandas_edgelist(df.melt(cols[0], cols[1:]),
source=cols[0], target='value')
mapper = {node: i for i, sub in enumerate(nx.connected_components(G), start=1)
for node in sub}
df['Group'] = df[cols[0]].map(mapper)
Output:
Value_1 Value_2 Value_3 Group
0 9 10 20 1
1 1 15 21 2
2 0 15 22 2
3 4 0 23 2
4 2 3 23 2
Graph:
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论