2023年2月6日 18:36:48go评论93阅读模式

英文:

aggregate pivot in pandas with multiple repeated fields

问题

我有一个数据框，看起来像这样：

id       Field_name  Field_value
1           consent          yes
1   _REACTION TIME_         5547
1              age            24
1           gender             X
1   _REACTION TIME_        45396
1         education          uni
1          language           EN
1   _REACTION TIME_       105187
2           consent          yes
2   _REACTION TIME_         3547
2              age            25
2           gender             F
2   _REACTION TIME_        42396
2         education          uni
2          language           EU
2   _REACTION TIME_       115427

我想将其按每个id一行的方式排列，每个 _REACTION TIME_ 行作为不同的列，如下所示：

id  consent  _REACTION TIME_1  age gender  _REACTION TIME_2  education language _REACTION TIME_3
1       yes              5547   24      X             45396        uni       EN           105187
2       yes              3547   25      F             42396        uni       EU           115427

英文:

I have a dataframe that looks like this:

id       Field_name  Field_value
1           consent          yes
1   _REACTION TIME_         5547
1              age            24
1           gender             X
1   _REACTION TIME_        45396
1         education          uni
1          language           EN
1   _REACTION TIME_       105187
2           consent          yes
2   _REACTION TIME_         3547
2              age            25
2           gender             F
2   _REACTION TIME_        42396
2         education          uni
2          language           EU
2   _REACTION TIME_       115427

and I would like to have it as a row per id, with every _REACTION TIME_ row being a different column, such as:

id  consent  _REACTION TIME_1  age gender  _REACTION TIME_2  education language _REACTION TIME_3
1       yes              5547   24      X             45396        uni       EN           105187
2       yes              3547   25      F             42396        uni       EU           115427

I have been looking for an answer to this all over SO but I can't find it for this particular issue when only some of the entries are repeated, but they are repeated multiple times.

Thanks in advance!

答案1

得分: 2

使用GroupBy.cumcount仅适用于由DataFrame.duplicated检测到的重复行，因此可以通过DataFrame.pivot进行可能的数据透视操作，最后，按原始列的顺序添加DataFrame.reindex：

m = df.duplicated(['id', 'Field_name'], keep=False)
df.loc[m, 'Field_name'] += df[m].groupby(['id', 'Field_name']).cumcount().add(1).astype(str)
cols = df['Field_name'].unique()
df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
print(df)

解决方案避免覆盖原始DataFrame，操作类似：

m = df.duplicated(['id', 'Field_name'], keep=False)
s = df['Field_name'].add(df.groupby(['id', 'Field_name']).cumcount().add(1)
                           .astype(str)).where(m, df['Field_name'])
df1 = (df.assign(Field_name=s)
        .pivot(index='id', columns='Field_name', values='Field_value')
        .reindex(s.unique(), axis=1))
print(df1)

希望这有所帮助。

英文:

Use GroupBy.cumcount only for rows with duplicates by DataFrame.duplicated, so possible pivoting by DataFrame.pivot, last for original order of columns add DataFrame.reindex:

m = df.duplicated([&#39;id&#39;,&#39;Field_name&#39;], keep=False)
df.loc[m, &#39;Field_name&#39;] += df[m].groupby([&#39;id&#39;,&#39;Field_name&#39;]).cumcount().add(1).astype(str)
cols = df[&#39;Field_name&#39;].unique()
df = df.pivot(index=&#39;id&#39;, columns=&#39;Field_name&#39;, values=&#39;Field_value&#39;).reindex(cols, axis=1)
print (df)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education  \
id                                                                          
1              yes             5547  24      X            45396       uni   
2              yes             3547  25      F            42396       uni   
Field_name language _REACTION TIME_3  
id                                    
1                EN           105187  
2                EU           115427

Solution avoiding overwrite original DataFrame is similar:

m = df.duplicated([&#39;id&#39;,&#39;Field_name&#39;], keep=False)
s = df[&#39;Field_name&#39;].add(df.groupby([&#39;id&#39;,&#39;Field_name&#39;]).cumcount().add(1)
                           .astype(str)).where(m, df[&#39;Field_name&#39;])
df1 = (df.assign(Field_name=s)
        .pivot(index=&#39;id&#39;, columns=&#39;Field_name&#39;, values=&#39;Field_value&#39;)
        .reindex(s.unique(), axis=1))
print (df1)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education  \
id                                                                          
1              yes             5547  24      X            45396       uni   
2              yes             3547  25      F            42396       uni   
Field_name language _REACTION TIME_3  
id                                    
1                EN           105187  
2                EU           115427

答案2

得分: 1

If you want to remain _REACTION TIME_ instead of renaming it as _REACTION TIME_1 in column header, you can do groupby.apply

out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
       .reset_index(level=0).reset_index(drop=True)
       .rename_axis('', axis=1))

print(out)
   id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
0   1     yes            5547  24      X           45396       uni       EN          105187
1   2     yes            3547  25      F           42396       uni       EU          115427

英文:

If you want to remain _REACTION TIME_ instead of renaming it as _REACTION TIME_1 in column header, you can do groupby.apply

out = (df.groupby(&#39;id&#39;).apply(lambda g: g.drop(&#39;id&#39;, axis=1).set_index(&#39;Field_name&#39;).T)
       .reset_index(level=0).reset_index(drop=True)
       .rename_axis(&#39;&#39;, axis=1))

print(out)
   id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
0   1     yes            5547  24      X           45396       uni       EN          105187
1   2     yes            3547  25      F           42396       uni       EU          115427

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在pandas中进行多个重复字段的汇总透视。

问题

答案1

答案2

无法从Scrapy API获取数据

Pandas通过分类列从当前列集创建一组新列的切片。

TypeError in Django: “float () argument must be a string or a number, not ‘tuple.'”

Searching for hidden API to scrape data with Python

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。