英文:
How to find the no. of nulls in every column in a polars dataframe?
问题
In pandas, one can do:
import pandas as pd
d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_pandas = pd.DataFrame.from_dict(d)
dict(df_pandas.isnull().sum())
[out]:
{'foo': 1, 'bar': 2}
In polars, you can achieve the same using a dictionary comprehension:
import polars as pl
d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_polars = pl.from_dict(d)
{col: df_polars[col].is_null().sum() for col in df_polars.columns}
Looping through columns in polars can be cumbersome, especially with LazyFrame
, where aggregation may require chunk-wise collection.
英文:
In pandas, one can do:
import pandas as pd
d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_pandas = pd.DataFrame.from_dict(d)
dict(df_pandas.isnull().sum())
[out]:
{'foo': 1, 'bar': 2}
In polars it's possible to do the same by looping through the columns:
import polars as pl
d = {"foo":[1,2,3, None], "bar":[4,None, None, 6]}
df_polars = pl.from_dict(d)
{col:df_polars[col].is_null().sum() for col in df_polars.columns}
Looping through the columns in polars is particularly painful when using LazyFrame
, then the .collect()
has to be done in chunks to do the aggregation.
Is there a way to find no. of nulls in every column in a polars dataframe without looping through each columns?
答案1
得分: 3
假设您不固守于输出格式,按惯例执行的方式是...
df.select(pl.all().is_null().sum())
但如果您真的喜欢字典格式的输出,您可以轻松地获得它...
df.select(pl.all().is_null().sum()).to_dicts()[0]
这个工作原理是,在select
内部,我们从pl.all()
开始,表示所有的列,然后,就像在pandas版本中一样,我们应用is_null
,它会返回True/False。然后我们链式应用sum
,将True变成1,从而得到每一列中的空值数量。
也可以使用专用的null_count()
,这样您就不必链式应用is_null().sum()
,感谢@jqurious提供的建议。
英文:
Assuming you're not married to the output format the idiomatic way to do it is...
df.select(pl.all().is_null().sum())
However if you really like the dict output you can easily get it...
df.select(pl.all().is_null().sum()).to_dicts()[0]
The way this works is that inside the select
we start with pl.all()
which means all of the columns and then, much like in the pandas version, we apply is_null
which would return True/False. From that we chain sum
which turns the Trues into 1s and gives you the number of nulls in each column.
There's also the dedicated null_count()
so you don't have to chain is_null().sum()
thanks to @jqurious for that tip.
答案2
得分: 0
如果您想按行计数,请改用以下代码:df.hstack(df.transpose().select(pl.all().is_null().sum()).transpose().rename({"column_0": "null_count"}))
英文:
If you want row wise counts use this instead: df.hstack(df.transpose().select(pl.all().is_null().sum()).transpose().rename({"column_0": "null_count"}))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论