2023年8月11日 03:45:26go评论259阅读模式

英文:

Concatenate strings from different rows in a dataframe matching the values of a list

问题

我有以下的数据框：
    test = [{'ID': 13562, 'Role':'AR','Location': 'London'},
        {'ID': 13562, 'Role':'ST','Location': 'London'},
        {'ID': 13562, 'Role':'MH','Location': 'London'},
        {'ID': 89912, 'Role':'ST','Location': 'Madrid'},
        {'ID': 89912, 'Role':'UL','Location': 'Madrid'},
        {'ID': 15673, 'Role':'AR','Location': 'Berlin'},
        {'ID': 34467, 'Role':'AR','Location': 'Berlin'},
        {'ID': 34777, 'Role':'ST','Location': 'Berlin'},
        {'ID': 34777, 'Role':'UL','Location': 'Berlin'}] 
和以下的列表：
    tlist = ['AR', 'AR_ST_MH','ST_UL','UL']
我试图创建一个新的数据框，其中我可以将ID和角色分组，如列表中所示。如您所见，数据框中的每个角色都在不同的行中分开，为每个角色重复一次ID（一个ID可以有一个或多个角色）。对我来说，具有挑战性的是将每个ID具有的不同角色组合在一起，以匹配tlist中的角色，并获得每个匹配的唯一行。
到目前为止，我尝试过这个：
    test['grouproles'] = test.groupby(['ID'])['Role'].transform(lambda x:'_'.join(x))
但我得到了这个：
[![.][1]][1]
这个输出有两个问题：
- 我得到了重复的角色。我真正想要的是保留与列表中值匹配的行。
- 第二个问题是它没有给我我要求的所有值。例如，列表中的一个值只是'AR'，如您所见，ID 13562具有角色AR，但输出没有给我这个选项。
为了更好地解释自己，这是我正在寻找的结果：
[![..][2]][2]
新的数据框，其中包含与tlist中所有选项匹配的新角色，并且只为所请求的匹配重复ID。
  [1]: https://i.stack.imgur.com/tQTHp.png
  [2]: https://i.stack.imgur.com/QXDU9.png

英文:

I have the following dataframe:

test = [{&#39;ID&#39;: 13562, &#39;Role&#39;:&#39;AR&#39;,&#39;Location&#39;: &#39;London&#39;},
{&#39;ID&#39;: 13562, &#39;Role&#39;:&#39;ST&#39;,&#39;Location&#39;: &#39;London&#39;},
{&#39;ID&#39;: 13562, &#39;Role&#39;:&#39;MH&#39;,&#39;Location&#39;: &#39;London&#39;},
{&#39;ID&#39;: 89912, &#39;Role&#39;:&#39;ST&#39;,&#39;Location&#39;: &#39;Madrid&#39;},
{&#39;ID&#39;: 89912, &#39;Role&#39;:&#39;UL&#39;,&#39;Location&#39;: &#39;Madrid&#39;},
{&#39;ID&#39;: 15673, &#39;Role&#39;:&#39;AR&#39;,&#39;Location&#39;: &#39;Berlin&#39;},
{&#39;ID&#39;: 34467, &#39;Role&#39;:&#39;AR&#39;,&#39;Location&#39;: &#39;Berlin&#39;},
{&#39;ID&#39;: 34777, &#39;Role&#39;:&#39;ST&#39;,&#39;Location&#39;: &#39;Berlin&#39;},
{&#39;ID&#39;: 34777, &#39;Role&#39;:&#39;UL&#39;,&#39;Location&#39;: &#39;Berlin&#39;}]

And the following list:

tlist = [&#39;AR&#39;, &#39;AR_ST_MH&#39;,&#39;ST_UL&#39;,&#39;UL&#39;]

I am trying to create a new dataframe where I can group the IDs and roles as it is shown in the list. As you can see every single role in the dataframe is separated in different rows repeating each ID for each role (a single ID can have one or many roles). What is being challenging for me is putting together the different roles that each ID has to match the roles in the tlist and get a unique row for each match.

So far I tried this:

test[&#39;grouproles&#39;] = test.groupby([&#39;ID&#39;])[&#39;Role&#39;].transform(lambda x:&#39;_&#39;.join(x))

But I got this:

There are two problems with this output:

I am getting repeated the roles. What I would really like is to keep only the rows that matches the values in the list.
The second problem is that it does not give me all the values I am asking for. For example, one of the values in the list is 'AR' only, as you see the ID 13562 has got the role AR but the output does not give me that option.

To explain myself a bit better, this is the result I am looking for:

New dataframe with the new roles matching all the options in the tlist and repeating the IDs only for the requested matches.

答案1

得分: 1

以下是翻译好的部分：

你可以使用集合操作：

tlist = ['AR', 'AR_ST_MH', 'ST_UL', 'UL']
d = {frozenset(s.split('_')): s for s in tlist}
# {frozenset({'AR'}): 'AR',
#  frozenset({'AR', 'MH', 'ST'}): 'AR_ST_MH',
#  frozenset({'ST', 'UL'}): 'ST_UL',
#  frozenset({'UL'}): 'UL'}
tmp = df.groupby('ID')['Role'].agg(frozenset)
out = (df[['ID', 'Location']]
 .drop_duplicates()
 .merge(tmp.apply(lambda s: [v for k, v in d.items() if k<=s]).explode(), on='ID')
)

相同的逻辑适用于字符串，如果顺序保持不变：

tmp = df.groupby('ID')['Role'].agg('_'.join)
out = (df[['ID', 'Location']]
 .drop_duplicates()
 .merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on='ID')
)

输出：

      ID Location      Role
0  13562   London        AR
1  13562   London  AR_ST_MH
2  89912   Madrid     ST_UL
3  89912   Madrid        UL
4  15673   Berlin        AR
5  34467   Berlin        AR
6  34777   Berlin     ST_UL
7  34777   Berlin        UL

注：搜索的时间复杂度为O(n²)，但可以通过保持每个项目的有效集合的字典来提高性能。例如，对于AR_ST_MH，您不必搜索UL。

英文:

You can use set operations:

tlist = [&#39;AR&#39;, &#39;AR_ST_MH&#39;,&#39;ST_UL&#39;,&#39;UL&#39;]
d = {frozenset(s.split(&#39;_&#39;)): s for s in tlist}
# {frozenset({&#39;AR&#39;}): &#39;AR&#39;,
#  frozenset({&#39;AR&#39;, &#39;MH&#39;, &#39;ST&#39;}): &#39;AR_ST_MH&#39;,
#  frozenset({&#39;ST&#39;, &#39;UL&#39;}): &#39;ST_UL&#39;,
#  frozenset({&#39;UL&#39;}): &#39;UL&#39;}
tmp = df.groupby(&#39;ID&#39;)[&#39;Role&#39;].agg(frozenset)
out = (df[[&#39;ID&#39;, &#39;Location&#39;]]
.drop_duplicates()
.merge(tmp.apply(lambda s: [v for k, v in d.items() if k&lt;=s]).explode(), on=&#39;ID&#39;)
)

Same logic with strings, if the order is conserved:

tmp = df.groupby(&#39;ID&#39;)[&#39;Role&#39;].agg(&#39;_&#39;.join)
out = (df[[&#39;ID&#39;, &#39;Location&#39;]]
.drop_duplicates()
.merge(tmp.apply(lambda s: [s2 for s2 in tlist if s2 in s]).explode(), on=&#39;ID&#39;)
)

Output:

      ID Location      Role
0  13562   London        AR
1  13562   London  AR_ST_MH
2  89912   Madrid     ST_UL
3  89912   Madrid        UL
4  15673   Berlin        AR
5  34467   Berlin        AR
6  34777   Berlin     ST_UL
7  34777   Berlin        UL

NB. the search is O(n²), you can however improve this by keeping a dictionary of valid sets for each item. For example, for AR_ST_MH you don't have to search UL.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在数据框中连接匹配列表数值的不同行的字符串

问题

答案1

在Java中替换或重命名两个标记之间的字符串并作为字符串返回。

How can I make a function with a dictionary of occurrences from list I set later?

我的Java流的过滤器不起作用。

chr(122 – (ord(char) – 97))的用法如何？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。