在pandas中进行多个重复字段的汇总透视。

huangapple go评论93阅读模式
英文:

aggregate pivot in pandas with multiple repeated fields

问题

我有一个数据框,看起来像这样:

  1. id Field_name Field_value
  2. 1 consent yes
  3. 1 _REACTION TIME_ 5547
  4. 1 age 24
  5. 1 gender X
  6. 1 _REACTION TIME_ 45396
  7. 1 education uni
  8. 1 language EN
  9. 1 _REACTION TIME_ 105187
  10. 2 consent yes
  11. 2 _REACTION TIME_ 3547
  12. 2 age 25
  13. 2 gender F
  14. 2 _REACTION TIME_ 42396
  15. 2 education uni
  16. 2 language EU
  17. 2 _REACTION TIME_ 115427

我想将其按每个id一行的方式排列,每个 _REACTION TIME_ 行作为不同的列,如下所示:

  1. id consent _REACTION TIME_1 age gender _REACTION TIME_2 education language _REACTION TIME_3
  2. 1 yes 5547 24 X 45396 uni EN 105187
  3. 2 yes 3547 25 F 42396 uni EU 115427
英文:

I have a dataframe that looks like this:

  1. id Field_name Field_value
  2. 1 consent yes
  3. 1 _REACTION TIME_ 5547
  4. 1 age 24
  5. 1 gender X
  6. 1 _REACTION TIME_ 45396
  7. 1 education uni
  8. 1 language EN
  9. 1 _REACTION TIME_ 105187
  10. 2 consent yes
  11. 2 _REACTION TIME_ 3547
  12. 2 age 25
  13. 2 gender F
  14. 2 _REACTION TIME_ 42396
  15. 2 education uni
  16. 2 language EU
  17. 2 _REACTION TIME_ 115427

and I would like to have it as a row per id, with every _REACTION TIME_ row being a different column, such as:

  1. id consent _REACTION TIME_1 age gender _REACTION TIME_2 education language _REACTION TIME_3
  2. 1 yes 5547 24 X 45396 uni EN 105187
  3. 2 yes 3547 25 F 42396 uni EU 115427

I have been looking for an answer to this all over SO but I can't find it for this particular issue when only some of the entries are repeated, but they are repeated multiple times.

Thanks in advance!

答案1

得分: 2

使用GroupBy.cumcount仅适用于由DataFrame.duplicated检测到的重复行,因此可以通过DataFrame.pivot进行可能的数据透视操作,最后,按原始列的顺序添加DataFrame.reindex

  1. m = df.duplicated(['id', 'Field_name'], keep=False)
  2. df.loc[m, 'Field_name'] += df[m].groupby(['id', 'Field_name']).cumcount().add(1).astype(str)
  3. cols = df['Field_name'].unique()
  4. df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
  5. print(df)

解决方案避免覆盖原始DataFrame,操作类似:

  1. m = df.duplicated(['id', 'Field_name'], keep=False)
  2. s = df['Field_name'].add(df.groupby(['id', 'Field_name']).cumcount().add(1)
  3. .astype(str)).where(m, df['Field_name'])
  4. df1 = (df.assign(Field_name=s)
  5. .pivot(index='id', columns='Field_name', values='Field_value')
  6. .reindex(s.unique(), axis=1))
  7. print(df1)

希望这有所帮助。

英文:

Use GroupBy.cumcount only for rows with duplicates by DataFrame.duplicated, so possible pivoting by DataFrame.pivot, last for original order of columns add DataFrame.reindex:

  1. m = df.duplicated(['id','Field_name'], keep=False)
  2. df.loc[m, 'Field_name'] += df[m].groupby(['id','Field_name']).cumcount().add(1).astype(str)
  3. cols = df['Field_name'].unique()
  4. df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
  5. print (df)
  6. Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education \
  7. id
  8. 1 yes 5547 24 X 45396 uni
  9. 2 yes 3547 25 F 42396 uni
  10. Field_name language _REACTION TIME_3
  11. id
  12. 1 EN 105187
  13. 2 EU 115427

Solution avoiding overwrite original DataFrame is similar:

  1. m = df.duplicated(['id','Field_name'], keep=False)
  2. s = df['Field_name'].add(df.groupby(['id','Field_name']).cumcount().add(1)
  3. .astype(str)).where(m, df['Field_name'])
  4. df1 = (df.assign(Field_name=s)
  5. .pivot(index='id', columns='Field_name', values='Field_value')
  6. .reindex(s.unique(), axis=1))
  7. print (df1)
  8. Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education \
  9. id
  10. 1 yes 5547 24 X 45396 uni
  11. 2 yes 3547 25 F 42396 uni
  12. Field_name language _REACTION TIME_3
  13. id
  14. 1 EN 105187
  15. 2 EU 115427

答案2

得分: 1

If you want to remain _REACTION TIME_ instead of renaming it as _REACTION TIME_1 in column header, you can do groupby.apply

  1. out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
  2. .reset_index(level=0).reset_index(drop=True)
  3. .rename_axis('', axis=1))
  1. print(out)
  2. id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
  3. 0 1 yes 5547 24 X 45396 uni EN 105187
  4. 1 2 yes 3547 25 F 42396 uni EU 115427
英文:

If you want to remain _REACTION TIME_ instead of renaming it as _REACTION TIME_1 in column header, you can do groupby.apply

  1. out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
  2. .reset_index(level=0).reset_index(drop=True)
  3. .rename_axis('', axis=1))
  1. print(out)
  2. id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
  3. 0 1 yes 5547 24 X 45396 uni EN 105187
  4. 1 2 yes 3547 25 F 42396 uni EU 115427

huangapple
  • 本文由 发表于 2023年2月6日 18:36:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75360168.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定