英文:
How to pass a parameter to an aggregation function in Dask
问题
我刚刚发现,在pandas
和Dask
中,一个包含大量NaN的列的总和是0(为什么?!)。我需要所有NaN的总和为0,因为NaN表示这些值缺失,所以它们的总和也应该是NaN。
根据文档,似乎需要将min_count = 0
传递以复制此行为。然而,我正在进行如下聚合操作:
ddf.groupby("code").aggregate({'rain':'sum'}).compute()
在aggregate
函数中添加min_count
参数似乎没有影响,而在'sum'
的位置使用lambda
会引发错误。
英文:
I just discovered today that a sum of a column full of NaNs is 0 in pandas
and Dask
(why?!). I need a sum of all NaNs to be 0, because having NaNs means those values are missing, so their sum should be NaN as well.
From the documentation it appears that you have to pass min_count = 0
to replicate this behaviour. However, I'm doing the sum into an aggregation that looks like this
ddf.groupby("code").aggregate({'rain':'sum'}).compute()
Adding the argument min_count
to the aggregate
function seems to have no impact, while using a lambda
in place of 'sum'
causes an error.
答案1
得分: 1
import dask.dataframe as dd
# 我们定义自己的求和函数,处理 NaN 值
custom_sum = dd.Aggregation('custom_sum',
lambda s: s.sum(min_count=1),
lambda s0: s0.sum(min_count=1))
ddf.groupby("code").aggregate({'rain': custom_sum}).compute()
英文:
Finally found out how to do this with custom aggregations.
import dask.dataframe as dd
# We define our own sum that takes care of NaN values
custom_sum = dd.Aggregation('custom_sum',
lambda s: s.sum(min_count=1),
lambda s0: s0.sum(min_count=1))
ddf.groupby("code").aggregate({'rain':custom_sum}).compute()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论