英文:
Pandas DataFrame.groupby().agg() issue
问题
我尝试使用pandas的groupby()和agg()方法,但遇到了一些问题。
我需要对某些列进行求和(使用agg({'column': sum})),对其他列进行加权平均,其中权重在一列中,还需要对剩余的列进行平均值计算。
我想在列C上进行加权平均,权重是列B中的值。
然后,对于在agg()中未指定的列(F、G等等,我有很多这些列),我想应用.mean()方法,最终保留所有列。
您能帮助我吗?
谢谢
我尝试过以下代码:
df = df.groupby(['Date', 'Hour']).agg({'A': 'sum',
'B': 'sum',
'C': 加权平均?,
'D': 'sum',
'E': 'mean'}).reset_index()
但我不知道如何正确编写它。
英文:
I'm trying to use pandas groupby().agg() but I have some issues.
Date Year Month Week Hour A B C D E F G ..
mercoledì 5 aprile 2023 4 14 5 6 6 144,79 0 868,74 6 36
mercoledì 5 aprile 2023 4 14 6 214 214 144,79 0 30985,0 6 214
mercoledì 5 aprile 2023 4 14 6 6 6 144,79 0 868,74 6 36
mercoledì 5 aprile 2023 4 14 7 220 220 180,26 0 39657,2 220 48
mercoledì 5 aprile 2023 4 14 7 100 100 180,26 146 18026 100 10
mercoledì 5 aprile 2023 4 14 8 220 220 225,2 0 49544 220 48
mercoledì 5 aprile 2023 4 14 8 57 57 2,2 146 129,38 6 57
I have to sum some columns (and that goes with agg({'column':sum}), do a weighted average on others, with the weights being in a column, and have to mean() the remaining columns.
df = df.groupby(['Date','Hour']).agg({'A':'sum',
'B':'sum',
'C': weighted average?,
'D':'sum',
'E':'mean'}).reset_index()
I wanna do the weighted average on C, and the weights are the values in column B.
Then, for the non-indicated columns in .agg() (F,G, and so on, I have many of them) I wanna apply the method .mean(), keeping all the columns in the end.
Can you help me?
Thank you
Tried this:
df = df.groupby(['Date','Hour']).agg({'A':'sum',
'B':'sum',
'C': weighted average?,
'D':'sum',
'E':'mean'}).reset_index()
But I don't know how to properly code it
答案1
得分: 1
以下是翻译好的部分:
import pandas as pd
data = {
"Date": ["mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023"],
"Year": [2023, 2023, 2023, 2023, 2023, 2023],
"Month": [4, 4, 4, 4, 4, 4],
"Week": [14, 14, 14, 14, 14, 14],
"Hour": [5, 6, 6, 7, 7, 8],
"A": [6, 214, 6, 220, 100, 220],
"B": [6, 214, 6, 220, 100, 57],
"C": [144.79, 144.79, 144.79, 180.26, 180.26, 2.2],
"D": [0, 0, 0, 0, 146, 146],
"E": [868.74, 30985.0, 868.74, 39657.2, 18026.0, 129.38],
"F": [6, 214, 36, 48, 10, 57],
"G": [36, 214, 36, 48, 10, 57],
}
df = pd.DataFrame(data)
print(df)
# Calculate sum of A, mean of B, and weighted mean of C using B as weights
result = df.groupby(['Date','Hour']).agg({
'A': 'sum',
'B': 'sum',
'C': lambda x: (df['B'] * df['C']).sum() / df['B'].sum(),
'D': 'sum',
'E': 'mean'
})
print(result.reset_index())
请注意,代码中的HTML实体(如"
和'
)没有被翻译,因为它们是代码的一部分,不需要翻译。
英文:
Something like this:
import pandas as pd
data = {
"Date": ["mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023", "mercoledì 5 aprile 2023"],
"Year": [2023, 2023, 2023, 2023, 2023, 2023],
"Month": [4, 4, 4, 4, 4, 4],
"Week": [14, 14, 14, 14, 14, 14],
"Hour": [5, 6, 6, 7, 7, 8],
"A": [6, 214, 6, 220, 100, 220],
"B": [6, 214, 6, 220, 100, 57],
"C": [144.79, 144.79, 144.79, 180.26, 180.26, 2.2],
"D": [0, 0, 0, 0, 146, 146],
"E": [868.74, 30985.0, 868.74, 39657.2, 18026.0, 129.38],
"F": [6, 214, 36, 48, 10, 57],
"G": [36, 214, 36, 48, 10, 57],
}
df = pd.DataFrame(data)
print(df)
# Calculate sum of A, mean of B, and weighted mean of C using B as weights
result = df.groupby(['Date','Hour']).agg({
'A': 'sum',
'B': 'sum',
'C': lambda x: (df['B'] * df['C']).sum() / df['B'].sum(),
'D': 'sum',
'E': 'mean'
})
print(result.reset_index())
Date Hour A B C D E
0 mercoledì 5 aprile 2023 5 6 6 150.134561 0 868.74
1 mercoledì 5 aprile 2023 6 220 220 150.134561 0 15926.87
2 mercoledì 5 aprile 2023 7 320 320 150.134561 146 28841.60
3 mercoledì 5 aprile 2023 8 220 57 150.134561 146 129.38
答案2
得分: 0
不能直接使用agg
来计算加权平均值,因为这需要两列*。
一种方法是在计算前/后进行预处理。加权平均值等于sum(C*B)/sum(B)
:
out = (df.eval('C = C*B')
.groupby(['Date', 'Hour'])
.agg({'A': 'sum',
'B': 'sum',
'C': 'sum',
'D': 'sum',
'E': 'mean'})
.eval('C = C/B')
.reset_index()
)
*注意:如果您已经使用B/C计算不同的聚合值,您需要使用它们的副本。
要处理所有列,您可以使用一个字典:
d = {c: 'mean' for c in df.columns.difference(['Date', 'Hour'])}
for c in ['A', 'B', 'C', 'D']:
d[c] = 'sum'
out = (df.eval('C = C*B')
.groupby(['Date', 'Hour'], as_index=False)
.agg(d)
.eval('C = C/B')
)
*您可以使用groupby.apply
来计算加权平均值,但这应该作为单独的操作进行。
英文:
You cannot compute a weighted average with agg
directly as this requires two columns*.
One way would be to pre-/post-process the computation. The weighted average is equal to sum(C*B)/sum(B)
:
out = (df.eval('C = C*B')
.groupby(['Date', 'Hour'])
.agg({'A': 'sum',
'B': 'sum',
'C': 'sum',
'D': 'sum',
'E': 'mean'})
.eval('C = C/B')
.reset_index()
)
NB. If you were already computing a different aggregation with B/C you would need to use copies of them.
To handle all columns you can use a dictionary:
d = {c: 'mean' for c in df.columns.difference(['Date', 'Hour'])}
for c in ['A', 'B', 'C', 'D']:
d[c] = 'sum'
out = (df.eval('C = C*B')
.groupby(['Date', 'Hour'], as_index=False)
.agg(d)
.eval('C = C/B')
)
* you can however compute the weighted average with groupby.apply
, but this should be done as a separate operation.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论