2023年6月5日 18:14:38go评论74阅读模式

英文:

How to group by elements of a list

问题

我有一个类似于这样的数据框：

81883       2011000011  ...  [South Sturgeon, Creek]
81884       2011000022  ...        [Meadowood]
81885       2011000016  ...   [South, Portage]
81886       2011000011  ...  [North Sturgeon, Creek]

我想要按照具有相同单词的行分组（单词是Locations列的值，由逗号分隔）。例如，在上面的示例中，我想要按Creek进行分组，当找不到相同单词时，将保留行（或更好地连接为字符串）。

我尝试使用以下代码：

def get_grp(list_current_row, df, column_location): 
    rows_index_to_groupby = [] 
    for string_element in list_current_row: 
        for idx, row in enumerate(df[column_location].values): 
            if row != list_current_row and string_element in row: 
                rows_index_to_groupby.append(idx) 
    return rows_index_to_groupby

grouped_dataframe = resulting_dataframe.groupby(lambda x: [resulting_dataframe[column_location][i] for i in get_grp(x, resulting_dataframe, column_location)])

期望的输出将是：

Locations
Creek             0  Creek       81886       2011000011  ...
                  1  Creek       81883       2011000011  ...
South, Portage    2  South, Portage      81885       2011000016  ...
Meadowood         3  Meadowood       81884    2011000022

英文:

I have a dataframe that resembles this:

81883       2011000011  ...  [South Sturgeon, Creek]
81884       2011000022  ...        [Meadowood]
81885       2011000016  ...   [South, Portage]
81886       2011000011  ...  [North Sturgeon, Creek]

I want to groupby rows that have common words (words are values of the Locations column splitted by ',') from the last column (named Locations): for example in the mentioned example I want to groupby Creek, and when no common words are found the rows will be kept as is (or better joined as string)
I tried using:

 def get_grp(list_current_row, df,column_location): 
    rows_index_to_groupby = [] 
    for string_element in list_current_row: 
        for idx,row in enumerate (df[column_location].values): 
            if row != list_current_row and string_element in row: 
                rows_index_to_groupby.append(idx) 
    return rows_index_to_groupby


 grouped_dataframe = resulting_dataframe.groupby(lambda x: [resulting_dataframe[column_location][i] for i in get_grp(x, resulting_dataframe,column_location)] )

The desired output would be:

Locations
Creek             0  Creek       81886       2011000011  ...
                  1  Creek       81883       2011000011  ...
South, Portage    2  South, Portage      81885       2011000016  ...
Meadowood         3  Meadowood       81884    2011000022

答案1

得分: 0

以下是翻译好的部分：

虽然不完全符合要求，但以下代码可能能满足您的需求。它只是提取位置值的最后一个元素，并将其赋值给索引：

```python
import pandas as pd

df = pd.DataFrame({
    'number': [81883, 81884, 81885, 81886], 
    'date': ["2011000011", "2011000022", "2011000016", "2011000011"],
    'location': [["South Sturgeon", "Creek"], ["Meadowood"], ["South", "Portage"], ["North Sturgeon", "Creek"]],
})

df.index = df.location.str[-1]
print(df)

输出结果如下：

           number        date                 location
location                                              
Creek       81883  2011000011  [South Sturgeon, Creek]
Meadowood   81884  2011000022              [Meadowood]
Portage     81885  2011000016         [South, Portage]
Creek       81886  2011000011  [North Sturgeon, Creek]

现在，您可以使用以下方法轻松获取所有Creek条目：

df.loc['Creek']

由于索引与列名“location”相同，您可能想要重命名索引：

df.index.names = ['primary_location']

然后，对于分组操作，您可以执行以下操作：

df.groupby('primary_location')['number'].sum()

primary_location
Creek       163769
Meadowood    81884
Portage      81885
Name: number, dtype: int64


<details>
<summary>英文:</summary>

While not exactly what is asked, the following might get what you want. This simply extracts the last element of the location values and assigns that to the index:

import pandas as pd

df = pd.DataFrame({
'number': [81883, 81884, 81885, 81886],
'date': ["2011000011", "2011000022", "2011000016", "2011000011"],
'location': [["South Sturgeon", "Creek"], ["Meadowood"], ["South", "Portage"], ["North Sturgeon", "Creek"]],
})

df.index = df.location.str[-1]
print(df)


yields

       number        date                 location

location
Creek 81883 2011000011 [South Sturgeon, Creek]
Meadowood 81884 2011000022 [Meadowood]
Portage 81885 2011000016 [South, Portage]
Creek 81886 2011000011 [North Sturgeon, Creek]


Now you can simply get all Creek entries with e.g.

df.loc['Creek']


Since the index has the same name as a column, &quot;location&quot;, you may want to rename the index:

df.index.names = ['primary_location']


and for grouped operations, you can then do e.g.

df.groupby('primary_location')['number'].sum()

primary_location
Creek 163769
Meadowood 81884
Portage 81885
Name: number, dtype: int64



</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何按列表元素分组

问题

答案1

将列表转换为字典在Python中

如何在Airflow中设置默认的重定向URL？

我如何在R中将列表中的数据框命名为它们来自的CSV文件？

参数和元组在Python中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论