英文:
Dynamic Aggregation in Polars
问题
我想要检查我的数据框中一组列的空值百分比,这些列根据另一组列进行分组。
数据框中的列可能会变化,所以我希望能够传递一个要按组分组的列列表以及一个要计算空值的列列表。
轻松分组可以使用列名列表,但我找不到一种方法来对计数进行类似的操作。我只能为要计算空值的每一列编写一行代码:
```python
df = df.groupby(by=list_of_grouping_fields)\
.agg([
pl.count().alias('Row_Count'),
pl.col("INVOICE NUMBER").null_count().alias('INVOICENUMBER_nullcount'),
pl.col("ORDER NUMBER").null_count().alias('ORDERNUMBER_nullcount')
])
这似乎效率低下,如果列发生变化,我需要回来编辑代码。
理想情况下,我希望能做类似这样的事情:
df = df.groupby(by=list_of_grouping_fields)\
.agg([
pl.count().alias('Row_Count'),
pl.col(list_of_agg_fields).null_count().alias(list_of_aliases)
])
在Polars中是否有一种方法可以实现这个目标?
<details>
<summary>英文:</summary>
I'd like to check the percentage of nulls for a set of columns in my dataframe, grouped by a different set of columns.
The columns in the dataframe may change, so I'd like to be able to pass in a list of columns to group by and a list of columns to count nulls for.
Grouping easily takes a list of column names, but I can't find a way to do this for the counts. I can only do it by writing a line for each column I want to count nulls for:
df = df.groupby(by=list_of_grouping_fields)
.agg([
pl.count().alias('Row_Count'),
pl.col("INVOICE NUMBER").null_count().alias('INVOICENUMBER_nullcount'),
pl.col("ORDER NUMBER").null_count().alias('ORDERNUMBER_nullcount')
])
This seems inefficient and would require me to come back and edit the code if the columns changed.
Ideally I'd like to do something like this:
df = df.groupby(by=list_of_grouping_fields)
.agg([
pl.count().alias('Row_Count'),
pl.col([list_of_agg_fields]).null_count().alias([list_of_aliases])
])
Is there a way to do this with Polars?
</details>
# 答案1
**得分**: 1
[`agg`](https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.dataframe.groupby.GroupBy.agg.html#polars.dataframe.groupby.GroupBy.agg) 接受可变的位置参数和关键字参数,因此您应该能够执行以下操作:
```python
.agg(
pl.count().alias('Row_Count'),
*[pl.col(af).null_count().alias(al) for af, al in zip(list_of_agg_fields, list_of_aliases)]
)
或者,由于关键字参数是别名的快捷方式,您也可以这样做:
.agg(
pl.count().alias('Row_Count'), # 或者 Row_Count=pl.count()
**{al: af.null_count() for af, al in zip(list_of_agg_fields, list_of_aliases)}
)
几个月前曾经进行了大力改进,使许多核心函数的签名更加灵活,就像这样。请注意,groupby
也接受可变的位置参数,第一个参数可以是单个表达式或表达式的可迭代对象,这有点是之前这个努力的遗留物。
英文:
agg
takes variadic positional args and keyword args, so you should be able to do
.agg(
pl.count().alias('Row_Count'),
*[pl.col(af).null_count().alias(al) for af,al in zip(list_of_agg_fields, list_of_aliases)]
)
or, since keyword args is a shortcut for alias:
.agg(
pl.count().alias('Row_Count'), # or Row_Count=pl.count()
**{al : af.null_count() for af,al in zip(list_of_agg_fields, list_of_aliases}
)
There was a big effort a few months back to make a lot of core function signatures be more flexible like this. Note that groupby
takes variadic positional as well, the first argument being either a single expression or iterable of expressions is a bit of a leftover from before this effort.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论