英文:
Concatenate strings from different rows in a dataframe matching the values of a list
问题
我有以下的数据框:
test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
{'ID': 13562, 'Role':'ST','Location': 'London'},
{'ID': 13562, 'Role':'MH','Location': 'London'},
{'ID': 89912, 'Role':'ST','Location': 'Madrid'},
{'ID': 89912, 'Role':'UL','Location': 'Madrid'},
{'ID': 15673, 'Role':'AR','Location': 'Berlin'},
{'ID': 34467, 'Role':'AR','Location': 'Berlin'},
{'ID': 34777, 'Role':'ST','Location': 'Berlin'},
{'ID': 34777, 'Role':'UL','Location': 'Berlin'}]
和以下的列表:
tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
我试图创建一个新的数据框,其中我可以将ID和角色分组,如列表中所示。如您所见,数据框中的每个角色都在不同的行中分开,为每个角色重复一次ID(一个ID可以有一个或多个角色)。对我来说,具有挑战性的是将每个ID具有的不同角色组合在一起,以匹配tlist中的角色,并获得每个匹配的唯一行。
到目前为止,我尝试过这个:
test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))
但我得到了这个:
[![.][1]][1]
这个输出有两个问题:
- 我得到了重复的角色。我真正想要的是保留与列表中值匹配的行。
- 第二个问题是它没有给我我要求的所有值。例如,列表中的一个值只是'AR',如您所见,ID 13562具有角色AR,但输出没有给我这个选项。
为了更好地解释自己,这是我正在寻找的结果:
[![..][2]][2]
新的数据框,其中包含与tlist中所有选项匹配的新角色,并且只为所请求的匹配重复ID。
[1]: https://i.stack.imgur.com/tQTHp.png
[2]: https://i.stack.imgur.com/QXDU9.png
英文:
I have the following dataframe:
test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
{'ID': 13562, 'Role':'ST','Location': 'London'},
{'ID': 13562, 'Role':'MH','Location': 'London'},
{'ID': 89912, 'Role':'ST','Location': 'Madrid'},
{'ID': 89912, 'Role':'UL','Location': 'Madrid'},
{'ID': 15673, 'Role':'AR','Location': 'Berlin'},
{'ID': 34467, 'Role':'AR','Location': 'Berlin'},
{'ID': 34777, 'Role':'ST','Location': 'Berlin'},
{'ID': 34777, 'Role':'UL','Location': 'Berlin'}]
And the following list:
tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
I am trying to create a new dataframe where I can group the IDs and roles as it is shown in the list. As you can see every single role in the dataframe is separated in different rows repeating each ID for each role (a single ID can have one or many roles). What is being challenging for me is putting together the different roles that each ID has to match the roles in the tlist and get a unique row for each match.
So far I tried this:
test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))
But I got this:
There are two problems with this output:
- I am getting repeated the roles. What I would really like is to keep only the rows that matches the values in the list.
- The second problem is that it does not give me all the values I am asking for. For example, one of the values in the list is 'AR' only, as you see the ID 13562 has got the role AR but the output does not give me that option.
To explain myself a bit better, this is the result I am looking for:
New dataframe with the new roles matching all the options in the tlist and repeating the IDs only for the requested matches.
答案1
得分: 1
以下是翻译好的部分:
你可以使用集合操作:
tlist = ['AR', 'AR_ST_MH', 'ST_UL', 'UL']
d = {frozenset(s.split('_')): s for s in tlist}
# {frozenset({'AR'}): 'AR',
# frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
# frozenset({'ST', 'UL'}): 'ST_UL',
# frozenset({'UL'}): 'UL'}
tmp = df.groupby('ID')['Role'].agg(frozenset)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
)
相同的逻辑适用于字符串,如果顺序保持不变:
tmp = df.groupby('ID')['Role'].agg('_'.join)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
)
输出:
ID Location Role
0 13562 London AR
1 13562 London AR_ST_MH
2 89912 Madrid ST_UL
3 89912 Madrid UL
4 15673 Berlin AR
5 34467 Berlin AR
6 34777 Berlin ST_UL
7 34777 Berlin UL
注:搜索的时间复杂度为O(n²),但可以通过保持每个项目的有效集合的字典来提高性能。例如,对于AR_ST_MH
,您不必搜索UL
。
英文:
You can use set operations:
tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
d = {frozenset(s.split('_')): s for s in tlist}
# {frozenset({'AR'}): 'AR',
# frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
# frozenset({'ST', 'UL'}): 'ST_UL',
# frozenset({'UL'}): 'UL'}
tmp = df.groupby('ID')['Role'].agg(frozenset)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
)
Same logic with strings, if the order is conserved:
tmp = df.groupby('ID')['Role'].agg('_'.join)
out = (df[['ID', 'Location']]
.drop_duplicates()
.merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
)
Output:
ID Location Role
0 13562 London AR
1 13562 London AR_ST_MH
2 89912 Madrid ST_UL
3 89912 Madrid UL
4 15673 Berlin AR
5 34467 Berlin AR
6 34777 Berlin ST_UL
7 34777 Berlin UL
NB. the search is O(n²), you can however improve this by keeping a dictionary of valid sets for each item. For example, for AR_ST_MH
you don't have to search UL
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论