如何按列表元素分组

huangapple go评论74阅读模式
英文:

How to group by elements of a list

问题

我有一个类似于这样的数据框:

81883       2011000011  ...  [South Sturgeon, Creek]
81884       2011000022  ...        [Meadowood]
81885       2011000016  ...   [South, Portage]
81886       2011000011  ...  [North Sturgeon, Creek]

我想要按照具有相同单词的行分组(单词是Locations列的值,由逗号分隔)。例如,在上面的示例中,我想要按Creek进行分组,当找不到相同单词时,将保留行(或更好地连接为字符串)。

我尝试使用以下代码:

def get_grp(list_current_row, df, column_location): 
    rows_index_to_groupby = [] 
    for string_element in list_current_row: 
        for idx, row in enumerate(df[column_location].values): 
            if row != list_current_row and string_element in row: 
                rows_index_to_groupby.append(idx) 
    return rows_index_to_groupby

grouped_dataframe = resulting_dataframe.groupby(lambda x: [resulting_dataframe[column_location][i] for i in get_grp(x, resulting_dataframe, column_location)])

期望的输出将是:

Locations
Creek             0  Creek       81886       2011000011  ...
                  1  Creek       81883       2011000011  ...
South, Portage    2  South, Portage      81885       2011000016  ...
Meadowood         3  Meadowood       81884    2011000022
英文:

I have a dataframe that resembles this:

81883       2011000011  ...  [South Sturgeon, Creek]
81884       2011000022  ...        [Meadowood]
81885       2011000016  ...   [South, Portage]
81886       2011000011  ...  [North Sturgeon, Creek]

I want to groupby rows that have common words (words are values of the Locations column splitted by ',') from the last column (named Locations): for example in the mentioned example I want to groupby Creek, and when no common words are found the rows will be kept as is (or better joined as string)
I tried using:

 def get_grp(list_current_row, df,column_location): 
    rows_index_to_groupby = [] 
    for string_element in list_current_row: 
        for idx,row in enumerate (df[column_location].values): 
            if row != list_current_row and string_element in row: 
                rows_index_to_groupby.append(idx) 
    return rows_index_to_groupby


 grouped_dataframe = resulting_dataframe.groupby(lambda x: [resulting_dataframe[column_location][i] for i in get_grp(x, resulting_dataframe,column_location)] )

The desired output would be:

Locations
Creek             0  Creek       81886       2011000011  ...
                  1  Creek       81883       2011000011  ...
South, Portage    2  South, Portage      81885       2011000016  ...
Meadowood         3  Meadowood       81884    2011000022

答案1

得分: 0

以下是翻译好的部分:

虽然不完全符合要求但以下代码可能能满足您的需求它只是提取位置值的最后一个元素并将其赋值给索引

```python
import pandas as pd

df = pd.DataFrame({
    'number': [81883, 81884, 81885, 81886], 
    'date': ["2011000011", "2011000022", "2011000016", "2011000011"],
    'location': [["South Sturgeon", "Creek"], ["Meadowood"], ["South", "Portage"], ["North Sturgeon", "Creek"]],
})

df.index = df.location.str[-1]
print(df)

输出结果如下:

           number        date                 location
location                                              
Creek       81883  2011000011  [South Sturgeon, Creek]
Meadowood   81884  2011000022              [Meadowood]
Portage     81885  2011000016         [South, Portage]
Creek       81886  2011000011  [North Sturgeon, Creek]

现在,您可以使用以下方法轻松获取所有Creek条目:

df.loc['Creek']

由于索引与列名“location”相同,您可能想要重命名索引:

df.index.names = ['primary_location']

然后,对于分组操作,您可以执行以下操作:

df.groupby('primary_location')['number'].sum()

primary_location
Creek       163769
Meadowood    81884
Portage      81885
Name: number, dtype: int64

<details>
<summary>英文:</summary>

While not exactly what is asked, the following might get what you want. This simply extracts the last element of the location values and assigns that to the index:

import pandas as pd

df = pd.DataFrame({
'number': [81883, 81884, 81885, 81886],
'date': ["2011000011", "2011000022", "2011000016", "2011000011"],
'location': [["South Sturgeon", "Creek"], ["Meadowood"], ["South", "Portage"], ["North Sturgeon", "Creek"]],
})

df.index = df.location.str[-1]
print(df)


yields

       number        date                 location

location
Creek 81883 2011000011 [South Sturgeon, Creek]
Meadowood 81884 2011000022 [Meadowood]
Portage 81885 2011000016 [South, Portage]
Creek 81886 2011000011 [North Sturgeon, Creek]


Now you can simply get all Creek entries with e.g.

df.loc['Creek']


Since the index has the same name as a column, &quot;location&quot;, you may want to rename the index:

df.index.names = ['primary_location']


and for grouped operations, you can then do e.g.

df.groupby('primary_location')['number'].sum()

primary_location
Creek 163769
Meadowood 81884
Portage 81885
Name: number, dtype: int64



</details>



huangapple
  • 本文由 发表于 2023年6月5日 18:14:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/76405413.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定