Pandas DataFrame.groupby().agg() 问题

huangapple go评论148阅读模式
英文:

Pandas DataFrame.groupby().agg() issue

问题

我尝试使用pandas的groupby()和agg()方法,但遇到了一些问题。

我需要对某些列进行求和(使用agg({'column': sum})),对其他列进行加权平均,其中权重在一列中,还需要对剩余的列进行平均值计算。

我想在列C上进行加权平均,权重是列B中的值。

然后,对于在agg()中未指定的列(F、G等等,我有很多这些列),我想应用.mean()方法,最终保留所有列。

您能帮助我吗?
谢谢

我尝试过以下代码:

  1. df = df.groupby(['Date', 'Hour']).agg({'A': 'sum',
  2. 'B': 'sum',
  3. 'C': 加权平均?,
  4. 'D': 'sum',
  5. 'E': 'mean'}).reset_index()

但我不知道如何正确编写它。

英文:

I'm trying to use pandas groupby().agg() but I have some issues.

  1. Date Year Month Week Hour A B C D E F G ..
  2. mercoledì 5 aprile 2023 4 14 5 6 6 144,79 0 868,74 6 36
  3. mercoledì 5 aprile 2023 4 14 6 214 214 144,79 0 30985,0 6 214
  4. mercoledì 5 aprile 2023 4 14 6 6 6 144,79 0 868,74 6 36
  5. mercoledì 5 aprile 2023 4 14 7 220 220 180,26 0 39657,2 220 48
  6. mercoledì 5 aprile 2023 4 14 7 100 100 180,26 146 18026 100 10
  7. mercoledì 5 aprile 2023 4 14 8 220 220 225,2 0 49544 220 48
  8. mercoledì 5 aprile 2023 4 14 8 57 57 2,2 146 129,38 6 57

I have to sum some columns (and that goes with agg({'column':sum}), do a weighted average on others, with the weights being in a column, and have to mean() the remaining columns.

  1. df = df.groupby(['Date','Hour']).agg({'A':'sum',
  2. 'B':'sum',
  3. 'C': weighted average?,
  4. 'D':'sum',
  5. 'E':'mean'}).reset_index()

I wanna do the weighted average on C, and the weights are the values in column B.

Then, for the non-indicated columns in .agg() (F,G, and so on, I have many of them) I wanna apply the method .mean(), keeping all the columns in the end.

Can you help me?
Thank you

Tried this:

  1. df = df.groupby(['Date','Hour']).agg({'A':'sum',
  2. 'B':'sum',
  3. 'C': weighted average?,
  4. 'D':'sum',
  5. 'E':'mean'}).reset_index()

But I don't know how to properly code it

答案1

得分: 1

以下是翻译好的部分:

  1. import pandas as pd
  2. data = {
  3. "Date": ["mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023"],
  4. "Year": [2023, 2023, 2023, 2023, 2023, 2023],
  5. "Month": [4, 4, 4, 4, 4, 4],
  6. "Week": [14, 14, 14, 14, 14, 14],
  7. "Hour": [5, 6, 6, 7, 7, 8],
  8. "A": [6, 214, 6, 220, 100, 220],
  9. "B": [6, 214, 6, 220, 100, 57],
  10. "C": [144.79, 144.79, 144.79, 180.26, 180.26, 2.2],
  11. "D": [0, 0, 0, 0, 146, 146],
  12. "E": [868.74, 30985.0, 868.74, 39657.2, 18026.0, 129.38],
  13. "F": [6, 214, 36, 48, 10, 57],
  14. "G": [36, 214, 36, 48, 10, 57],
  15. }
  16. df = pd.DataFrame(data)
  17. print(df)
  18. # Calculate sum of A, mean of B, and weighted mean of C using B as weights
  19. result = df.groupby(['Date','Hour']).agg({
  20. 'A': 'sum',
  21. 'B': 'sum',
  22. 'C': lambda x: (df['B'] * df['C']).sum() / df['B'].sum(),
  23. 'D': 'sum',
  24. 'E': 'mean'
  25. })
  26. print(result.reset_index())

请注意,代码中的HTML实体(如"')没有被翻译,因为它们是代码的一部分,不需要翻译。

英文:

Something like this:

  1. import pandas as pd
  2. data = {
  3. "Date": ["mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023"],
  4. "Year": [2023, 2023, 2023, 2023, 2023, 2023],
  5. "Month": [4, 4, 4, 4, 4, 4],
  6. "Week": [14, 14, 14, 14, 14, 14],
  7. "Hour": [5, 6, 6, 7, 7, 8],
  8. "A": [6, 214, 6, 220, 100, 220],
  9. "B": [6, 214, 6, 220, 100, 57],
  10. "C": [144.79, 144.79, 144.79, 180.26, 180.26, 2.2],
  11. "D": [0, 0, 0, 0, 146, 146],
  12. "E": [868.74, 30985.0, 868.74, 39657.2, 18026.0, 129.38],
  13. "F": [6, 214, 36, 48, 10, 57],
  14. "G": [36, 214, 36, 48, 10, 57],
  15. }
  16. df = pd.DataFrame(data)
  17. print(df)
  18. # Calculate sum of A, mean of B, and weighted mean of C using B as weights
  19. result = df.groupby(['Date','Hour']).agg({
  20. 'A': 'sum',
  21. 'B': 'sum',
  22. 'C': lambda x: (df['B'] * df['C']).sum() / df['B'].sum(),
  23. 'D': 'sum',
  24. 'E': 'mean'
  25. })
  26. print(result.reset_index())
  27. Date Hour A B C D E
  28. 0 mercoledì 5 aprile 2023 5 6 6 150.134561 0 868.74
  29. 1 mercoledì 5 aprile 2023 6 220 220 150.134561 0 15926.87
  30. 2 mercoledì 5 aprile 2023 7 320 320 150.134561 146 28841.60
  31. 3 mercoledì 5 aprile 2023 8 220 57 150.134561 146 129.38

答案2

得分: 0

不能直接使用agg来计算加权平均值,因为这需要两列*。

一种方法是在计算前/后进行预处理。加权平均值等于sum(C*B)/sum(B)

  1. out = (df.eval('C = C*B')
  2. .groupby(['Date', 'Hour'])
  3. .agg({'A': 'sum',
  4. 'B': 'sum',
  5. 'C': 'sum',
  6. 'D': 'sum',
  7. 'E': 'mean'})
  8. .eval('C = C/B')
  9. .reset_index()
  10. )

*注意:如果您已经使用B/C计算不同的聚合值,您需要使用它们的副本。

要处理所有列,您可以使用一个字典:

  1. d = {c: 'mean' for c in df.columns.difference(['Date', 'Hour'])}
  2. for c in ['A', 'B', 'C', 'D']:
  3. d[c] = 'sum'
  4. out = (df.eval('C = C*B')
  5. .groupby(['Date', 'Hour'], as_index=False)
  6. .agg(d)
  7. .eval('C = C/B')
  8. )

*您可以使用groupby.apply来计算加权平均值,但这应该作为单独的操作进行。

英文:

You cannot compute a weighted average with agg directly as this requires two columns*.

One way would be to pre-/post-process the computation. The weighted average is equal to sum(C*B)/sum(B):

  1. out = (df.eval('C = C*B')
  2. .groupby(['Date', 'Hour'])
  3. .agg({'A': 'sum',
  4. 'B': 'sum',
  5. 'C': 'sum',
  6. 'D': 'sum',
  7. 'E': 'mean'})
  8. .eval('C = C/B')
  9. .reset_index()
  10. )

NB. If you were already computing a different aggregation with B/C you would need to use copies of them.

To handle all columns you can use a dictionary:

  1. d = {c: 'mean' for c in df.columns.difference(['Date', 'Hour'])}
  2. for c in ['A', 'B', 'C', 'D']:
  3. d[c] = 'sum'
  4. out = (df.eval('C = C*B')
  5. .groupby(['Date', 'Hour'], as_index=False)
  6. .agg(d)
  7. .eval('C = C/B')
  8. )

* you can however compute the weighted average with groupby.apply, but this should be done as a separate operation.

huangapple
  • 本文由 发表于 2023年8月10日 20:08:44
  • 转载请务必保留本文链接:https://go.coder-hub.com/76875612.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定