Polars 和 Lazy API:如何删除只包含空值的列?

huangapple go评论79阅读模式
英文:

Polars and the Lazy API: How to drop columns that contain only null values?

问题

我正在使用Polars,需要在数据预处理过程中删除只包含空值的列。但是,我在使用Lazy API时遇到了困难。

例如,给定下面的表格,我应该如何使用Polars的Lazy API删除列"a"?

df = pl.DataFrame(
    {
        "a": [None, None, None, None],
        "b": [1, 2, None, 1],
        "c": [1, None, None, 1],
    }
)
df
英文:

I am working with Polars and need to drop columns that contain only null values during my data preprocessing. However, I am having trouble using the Lazy API to accomplish this.

For instance, given the table below, how can I drop column "a" using Polars' Lazy API?

df = pl.DataFrame(
    {
        "a": [None, None, None, None],
        "b": [1, 2, None, 1],
        "c": [1, None, None, 1],
    }
)
df
shape: (4, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ null ┆ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1    ┆ 1    │
└──────┴──────┴──────┘

I am aware of Issue #1613 and the solution of filtering columns where all values are null, but this is not Lazy API.

FYI,

# filter columns where all values are null
df[:, [not (s.null_count() == df.height) for s in df]]

I am also aware of the drop_nulls function in Polars, which can only drop all rows that contain null values, unlike the dropna function in Pandas that can take two arguments, axis and how.
Can someone provide an example of how to drop columns with all null values in Polars using the Lazy API?

答案1

得分: 0

你无法以你想要的方式来做到,至少目前不能。Polars不知道LazyFrame中哪些列只包含空值,直到你进行collect操作。这意呢你需要进行一次collect操作来获取你想要的列,然后再进行另一次以实现你想要的列。

让我们将你的df=df.lazy()转换为以下两个步骤:

步骤1:

(df.select(pl.all().is_null().all())
    .melt()
    .filter(pl.col('value')==False)
    .select('variable')
    .collect()
    .to_series()
    .to_list())

这些是没有空值的列,现在你可以将它们包装在自己的select中。

步骤2:

(df.select(
    df.select(pl.all().is_null().all())
        .melt()
        .filter(pl.col('value')==False)
        .select('variable')
        .collect()
        .to_series()
        .to_list())
.collect())
英文:

You can't, at least not in the way you want. polars doesn't know enough about the lazyframe to tell which columns are only nulls until you collect. That means you need a collect in order to get the columns you want and then another one to materialize the columns you wanted.

Let's turn your df=df.lazy()

Step 1:

(df.select(pl.all().is_null().all())
    .melt()
    .filter(pl.col('value')==False)
    .select('variable')
    .collect()
    .to_series()
    .to_list())

Those are your columns that have no nulls so now you wrap it in its own select

Step 2:

(df.select(
    df.select(pl.all().is_null().all())
        .melt()
        .filter(pl.col('value')==False)
        .select('variable')
        .collect()
        .to_series()
        .to_list())
.collect())

huangapple
  • 本文由 发表于 2023年5月26日 14:46:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76338261.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定