如何向Dask中的聚合函数传递参数。

huangapple go评论67阅读模式
英文:

How to pass a parameter to an aggregation function in Dask

问题

我刚刚发现,在pandasDask中,一个包含大量NaN的列的总和是0(为什么?!)。我需要所有NaN的总和为0,因为NaN表示这些值缺失,所以它们的总和也应该是NaN。

根据文档,似乎需要将min_count = 0传递以复制此行为。然而,我正在进行如下聚合操作:

ddf.groupby("code").aggregate({'rain':'sum'}).compute()

aggregate函数中添加min_count参数似乎没有影响,而在'sum'的位置使用lambda会引发错误。

英文:

I just discovered today that a sum of a column full of NaNs is 0 in pandas and Dask (why?!). I need a sum of all NaNs to be 0, because having NaNs means those values are missing, so their sum should be NaN as well.

From the documentation it appears that you have to pass min_count = 0 to replicate this behaviour. However, I'm doing the sum into an aggregation that looks like this

ddf.groupby("code").aggregate({'rain':'sum'}).compute()

Adding the argument min_count to the aggregate function seems to have no impact, while using a lambda in place of 'sum' causes an error.

答案1

得分: 1

import dask.dataframe as dd

# 我们定义自己的求和函数,处理 NaN 值
custom_sum = dd.Aggregation('custom_sum',
                            lambda s: s.sum(min_count=1),
                            lambda s0: s0.sum(min_count=1))

ddf.groupby("code").aggregate({'rain': custom_sum}).compute()
英文:

Finally found out how to do this with custom aggregations.

import dask.dataframe as dd

# We define our own sum that takes care of NaN values
custom_sum = dd.Aggregation('custom_sum',
                            lambda s: s.sum(min_count=1),
                            lambda s0: s0.sum(min_count=1))

ddf.groupby("code").aggregate({'rain':custom_sum}).compute()

huangapple
  • 本文由 发表于 2023年5月17日 18:46:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76271255.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定