Polars 中的动态聚合

huangapple go评论85阅读模式
英文:

Dynamic Aggregation in Polars

问题

我想要检查我的数据框中一组列的空值百分比,这些列根据另一组列进行分组。

数据框中的列可能会变化,所以我希望能够传递一个要按组分组的列列表以及一个要计算空值的列列表。

轻松分组可以使用列名列表,但我找不到一种方法来对计数进行类似的操作。我只能为要计算空值的每一列编写一行代码:

```python
df = df.groupby(by=list_of_grouping_fields)\
    .agg([
        pl.count().alias('Row_Count'),
        pl.col("INVOICE NUMBER").null_count().alias('INVOICENUMBER_nullcount'),
        pl.col("ORDER NUMBER").null_count().alias('ORDERNUMBER_nullcount')
        ])

这似乎效率低下,如果列发生变化,我需要回来编辑代码。

理想情况下,我希望能做类似这样的事情:

df = df.groupby(by=list_of_grouping_fields)\
    .agg([
        pl.count().alias('Row_Count'),
        pl.col(list_of_agg_fields).null_count().alias(list_of_aliases)
        ])

在Polars中是否有一种方法可以实现这个目标?



<details>
<summary>英文:</summary>

I&#39;d like to check the percentage of nulls for a set of columns in my dataframe, grouped by a different set of columns.


The columns in the dataframe may change, so I&#39;d like to be able to pass in a list of columns to group by and a list of columns to count nulls for.

Grouping easily takes a list of column names, but I can&#39;t find a way to do this for the counts. I can only do it by writing a line for each column I want to count nulls for:

df = df.groupby(by=list_of_grouping_fields)
.agg([
pl.count().alias('Row_Count'),
pl.col("INVOICE NUMBER").null_count().alias('INVOICENUMBER_nullcount'),
pl.col("ORDER NUMBER").null_count().alias('ORDERNUMBER_nullcount')
])


This seems inefficient and would require me to come back and edit the code if the columns changed.

Ideally I&#39;d like to do something like this:

df = df.groupby(by=list_of_grouping_fields)
.agg([
pl.count().alias('Row_Count'),
pl.col([list_of_agg_fields]).null_count().alias([list_of_aliases])
])


Is there a way to do this with Polars?

</details>


# 答案1
**得分**: 1

[`agg`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.dataframe.groupby.GroupBy.agg.html#polars.dataframe.groupby.GroupBy.agg) 接受可变的位置参数和关键字参数,因此您应该能够执行以下操作:

```python
.agg(
  pl.count().alias('Row_Count'),
  *[pl.col(af).null_count().alias(al) for af, al in zip(list_of_agg_fields, list_of_aliases)]
)

或者,由于关键字参数是别名的快捷方式,您也可以这样做:

.agg(
  pl.count().alias('Row_Count'), # 或者 Row_Count=pl.count()
  **{al: af.null_count() for af, al in zip(list_of_agg_fields, list_of_aliases)}
)

几个月前曾经进行了大力改进,使许多核心函数的签名更加灵活,就像这样。请注意,groupby 也接受可变的位置参数,第一个参数可以是单个表达式或表达式的可迭代对象,这有点是之前这个努力的遗留物。

英文:

agg takes variadic positional args and keyword args, so you should be able to do

.agg(
  pl.count().alias(&#39;Row_Count&#39;),
  *[pl.col(af).null_count().alias(al) for af,al in zip(list_of_agg_fields, list_of_aliases)]
)

or, since keyword args is a shortcut for alias:

.agg(
  pl.count().alias(&#39;Row_Count&#39;), # or Row_Count=pl.count()
  **{al : af.null_count() for af,al in zip(list_of_agg_fields, list_of_aliases}
)

There was a big effort a few months back to make a lot of core function signatures be more flexible like this. Note that groupby takes variadic positional as well, the first argument being either a single expression or iterable of expressions is a bit of a leftover from before this effort.

huangapple
  • 本文由 发表于 2023年7月27日 22:30:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76780759.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定