在数据框中连接匹配列表数值的不同行的字符串

huangapple go评论128阅读模式
英文:

Concatenate strings from different rows in a dataframe matching the values of a list

问题

我有以下的数据框

    test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
        {'ID': 13562, 'Role':'ST','Location': 'London'},
        {'ID': 13562, 'Role':'MH','Location': 'London'},
        {'ID': 89912, 'Role':'ST','Location': 'Madrid'},
        {'ID': 89912, 'Role':'UL','Location': 'Madrid'},
        {'ID': 15673, 'Role':'AR','Location': 'Berlin'},
        {'ID': 34467, 'Role':'AR','Location': 'Berlin'},
        {'ID': 34777, 'Role':'ST','Location': 'Berlin'},
        {'ID': 34777, 'Role':'UL','Location': 'Berlin'}] 

和以下的列表

    tlist = ['AR', 'AR_ST_MH','ST_UL','UL']

我试图创建一个新的数据框其中我可以将ID和角色分组如列表中所示如您所见数据框中的每个角色都在不同的行中分开为每个角色重复一次ID一个ID可以有一个或多个角色)。对我来说具有挑战性的是将每个ID具有的不同角色组合在一起以匹配tlist中的角色并获得每个匹配的唯一行

到目前为止我尝试过这个

    test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))

但我得到了这个

[![.][1]][1]

这个输出有两个问题
- 我得到了重复的角色我真正想要的是保留与列表中值匹配的行
- 第二个问题是它没有给我我要求的所有值例如列表中的一个值只是'AR'如您所见ID 13562具有角色AR但输出没有给我这个选项

为了更好地解释自己这是我正在寻找的结果

[![..][2]][2]

新的数据框其中包含与tlist中所有选项匹配的新角色并且只为所请求的匹配重复ID


  [1]: https://i.stack.imgur.com/tQTHp.png
  [2]: https://i.stack.imgur.com/QXDU9.png
英文:

I have the following dataframe:

test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
{'ID': 13562, 'Role':'ST','Location': 'London'},
{'ID': 13562, 'Role':'MH','Location': 'London'},
{'ID': 89912, 'Role':'ST','Location': 'Madrid'},
{'ID': 89912, 'Role':'UL','Location': 'Madrid'},
{'ID': 15673, 'Role':'AR','Location': 'Berlin'},
{'ID': 34467, 'Role':'AR','Location': 'Berlin'},
{'ID': 34777, 'Role':'ST','Location': 'Berlin'},
{'ID': 34777, 'Role':'UL','Location': 'Berlin'}] 

And the following list:

tlist = ['AR', 'AR_ST_MH','ST_UL','UL']

I am trying to create a new dataframe where I can group the IDs and roles as it is shown in the list. As you can see every single role in the dataframe is separated in different rows repeating each ID for each role (a single ID can have one or many roles). What is being challenging for me is putting together the different roles that each ID has to match the roles in the tlist and get a unique row for each match.

So far I tried this:

test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))

But I got this:

在数据框中连接匹配列表数值的不同行的字符串

There are two problems with this output:

  • I am getting repeated the roles. What I would really like is to keep only the rows that matches the values in the list.
  • The second problem is that it does not give me all the values I am asking for. For example, one of the values in the list is 'AR' only, as you see the ID 13562 has got the role AR but the output does not give me that option.

To explain myself a bit better, this is the result I am looking for:

在数据框中连接匹配列表数值的不同行的字符串

New dataframe with the new roles matching all the options in the tlist and repeating the IDs only for the requested matches.

答案1

得分: 1

以下是翻译好的部分:

你可以使用集合操作:

tlist = ['AR', 'AR_ST_MH', 'ST_UL', 'UL']
d = {frozenset(s.split('_')): s for s in tlist}
# {frozenset({'AR'}): 'AR',
#  frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
#  frozenset({'ST', 'UL'}): 'ST_UL',
#  frozenset({'UL'}): 'UL'}

tmp = df.groupby('ID')['Role'].agg(frozenset)

out = (df[['ID', 'Location']]
 .drop_duplicates()
 .merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
)

相同的逻辑适用于字符串,如果顺序保持不变:

tmp = df.groupby('ID')['Role'].agg('_'.join)

out = (df[['ID', 'Location']]
 .drop_duplicates()
 .merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
)

输出:

      ID Location      Role
0  13562   London        AR
1  13562   London  AR_ST_MH
2  89912   Madrid     ST_UL
3  89912   Madrid        UL
4  15673   Berlin        AR
5  34467   Berlin        AR
6  34777   Berlin     ST_UL
7  34777   Berlin        UL

注:搜索的时间复杂度为O(n²),但可以通过保持每个项目的有效集合的字典来提高性能。例如,对于AR_ST_MH,您不必搜索UL

英文:

You can use set operations:

tlist = [&#39;AR&#39;, &#39;AR_ST_MH&#39;,&#39;ST_UL&#39;,&#39;UL&#39;]
d = {frozenset(s.split(&#39;_&#39;)): s for s in tlist}
# {frozenset({&#39;AR&#39;}): &#39;AR&#39;,
#  frozenset({&#39;AR&#39;, &#39;MH&#39;, &#39;ST&#39;}): &#39;AR_ST_MH&#39;,
#  frozenset({&#39;ST&#39;, &#39;UL&#39;}): &#39;ST_UL&#39;,
#  frozenset({&#39;UL&#39;}): &#39;UL&#39;}
tmp = df.groupby(&#39;ID&#39;)[&#39;Role&#39;].agg(frozenset)
out = (df[[&#39;ID&#39;, &#39;Location&#39;]]
.drop_duplicates()
.merge(tmp.apply(lambda s: [v for k, v in d.items() if k&lt;=s]).explode(), on=&#39;ID&#39;)
)

Same logic with strings, if the order is conserved:

tmp = df.groupby(&#39;ID&#39;)[&#39;Role&#39;].agg(&#39;_&#39;.join)
out = (df[[&#39;ID&#39;, &#39;Location&#39;]]
.drop_duplicates()
.merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on=&#39;ID&#39;)
)

Output:

      ID Location      Role
0  13562   London        AR
1  13562   London  AR_ST_MH
2  89912   Madrid     ST_UL
3  89912   Madrid        UL
4  15673   Berlin        AR
5  34467   Berlin        AR
6  34777   Berlin     ST_UL
7  34777   Berlin        UL

NB. the search is O(n²), you can however improve this by keeping a dictionary of valid sets for each item. For example, for AR_ST_MH you don't have to search UL.

huangapple
  • 本文由 发表于 2023年8月11日 03:45:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76878879.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定