英文:
Polars and the Lazy API: How to drop columns that contain only null values?
问题
我正在使用Polars,需要在数据预处理过程中删除只包含空值的列。但是,我在使用Lazy API时遇到了困难。
例如,给定下面的表格,我应该如何使用Polars的Lazy API删除列"a"?
df = pl.DataFrame(
{
"a": [None, None, None, None],
"b": [1, 2, None, 1],
"c": [1, None, None, 1],
}
)
df
英文:
I am working with Polars and need to drop columns that contain only null values during my data preprocessing. However, I am having trouble using the Lazy API to accomplish this.
For instance, given the table below, how can I drop column "a" using Polars' Lazy API?
df = pl.DataFrame(
{
"a": [None, None, None, None],
"b": [1, 2, None, 1],
"c": [1, None, None, 1],
}
)
df
shape: (4, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ null ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1 ┆ 1 │
└──────┴──────┴──────┘
I am aware of Issue #1613 and the solution of filtering columns where all values are null, but this is not Lazy API.
FYI,
# filter columns where all values are null
df[:, [not (s.null_count() == df.height) for s in df]]
I am also aware of the drop_nulls function in Polars, which can only drop all rows that contain null values, unlike the dropna function in Pandas that can take two arguments, axis
and how
.
Can someone provide an example of how to drop columns with all null values in Polars using the Lazy API?
答案1
得分: 0
你无法以你想要的方式来做到,至少目前不能。Polars不知道LazyFrame中哪些列只包含空值,直到你进行collect
操作。这意呢你需要进行一次collect
操作来获取你想要的列,然后再进行另一次以实现你想要的列。
让我们将你的df=df.lazy()
转换为以下两个步骤:
步骤1:
(df.select(pl.all().is_null().all())
.melt()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
这些是没有空值的列,现在你可以将它们包装在自己的select
中。
步骤2:
(df.select(
df.select(pl.all().is_null().all())
.melt()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
.collect())
英文:
You can't, at least not in the way you want. polars doesn't know enough about the lazyframe to tell which columns are only nulls until you collect
. That means you need a collect in order to get the columns you want and then another one to materialize the columns you wanted.
Let's turn your df=df.lazy()
Step 1:
(df.select(pl.all().is_null().all())
.melt()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
Those are your columns that have no nulls so now you wrap it in its own select
Step 2:
(df.select(
df.select(pl.all().is_null().all())
.melt()
.filter(pl.col('value')==False)
.select('variable')
.collect()
.to_series()
.to_list())
.collect())
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论