英文:
Calculation problem after pandas grouping
问题
在pandas中,要实现按日期分组,在每个日期组中,将列A和B的每一行相乘,然后相加,再除以日期组中所有B列的总和,可以尝试以下方式:
(df.groupby('date')
.apply(lambda group: (group['A'] * group['B']).sum() / group['B'].sum())
.reset_index(name='result'))
这将返回一个包含日期、结果的DataFrame。
英文:
In pandas, how to achieve, grouped by date, in each date group, each row of column A and B multiplied and then summed, and then divided by the sum of all B columns in the date group.
I have tried:
(df.groupby('date')['A','B']
.transform(lambda x: (x['A'] * x['B']).sum())
.div(df.groupby('date')['B'].agg('sum')))
and:
(df.groupby('date')
.transform(lambda x: (x['A'] * x['B']).sum())
.div(df.groupby('date')['B'].agg('sum')))
both showed:
> KeyError: 'A'
答案1
得分: 1
你应该使用 .apply()
而不是 .transform()
,如下所示:
df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
通过使用 .apply()
,lambda
函数中的 x
变量将被识别为代表每个分组的 DataFrame
对象,可以分别索引各个列。
.transform()
方法只将 x
视为代表一次仅一个列的 Series
对象。
来自这个答案:
`apply` 和 `transform` 之间的两个主要区别
transform
和apply
groupby 方法之间有两个主要区别。
- 输入:
apply
隐式地将每个组的所有列作为DataFrame传递给自定义函数。- 而
transform
将每个组的每列分别作为Series传递给自定义函数。- 输出:
- 传递给**
apply
的自定义函数可以返回标量,或Series或DataFrame(甚至是numpy数组或列表)。- 传递给**
transform
的自定义函数必须返回一个序列**(长度与组相同的一维Series、数组或列表)。因此,
transform
仅逐一处理一列Series,而apply
一次处理整个DataFrame。
示例
示例数据:
import pandas as pd
data = {
'date': ['01/01/1999', '02/01/1999', '03/01/1999', '03/01/1999'],
'A': [2, 4, 7, 4],
'B': [5, 7, 9, 6]
}
df = pd.DataFrame(data)
print(df)
date A B
0 01/01/1999 2 5
1 02/01/1999 4 7
2 03/01/1999 7 9
3 03/01/1999 4 6
对于 03/01/1999
的计算将是:
((7 * 9) + (4 * 6)) / (9 + 6) # = 5.8
使用 .apply()
计算每个日期组的结果:
df_ab_calc = df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
print(df_ab_calc)
date
01/01/1999 2.0
02/01/1999 4.0
03/01/1999 5.8
英文:
You should use .apply()
instead of .transform()
, as follows:
df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
By using .apply()
, the x
variables in the lambda
function will be recognised as DataFrame
objects representing each group, which can each be indexed for individual columns.
The .transform()
method only treats x
as a Series
object representing only one column at a time.
From this answer:
<blockquote>
<h3>Two major differences between apply
and transform
</h3>
There are two major differences between the transform
and apply
groupby methods.
- Input:
apply
implicitly passes all the columns for each group as a DataFrame to the custom function.- while
transform
passes each column for each group individually as a Series to the custom function.
- Output:
- The custom function passed to
apply
can return a scalar, or a Series or DataFrame (or numpy array or even list). - The custom function passed to
transform
must return a sequence (a one dimensional Series, array or list) the same length as the group.
- The custom function passed to
So, transform
works on just one Series at a time and apply
works on the entire DataFrame at once.
</blockquote>
Example
Sample data:
import pandas as pd
data = {
'date': ['01/01/1999', '02/01/1999', '03/01/1999', '03/01/1999'],
'A': [2, 4, 7, 4],
'B': [5, 7, 9, 6]
}
df = pd.DataFrame(data)
print(df)
date A B
0 01/01/1999 2 5
1 02/01/1999 4 7
2 03/01/1999 7 9
3 03/01/1999 4 6
The calculation for '03/01/1999'
would be:
((7 * 9) + (4 * 6)) / (9 + 6) # = 5.8
Calculation for each date group using .apply()
:
df_ab_calc = df.groupby('date').apply(lambda x: (x['A'] * x['B']).sum() / x['B'].sum())
print(df_ab_calc)
date
01/01/1999 2.0
02/01/1999 4.0
03/01/1999 5.8
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论