遮蔽 polars 数据框以进行复杂操作

huangapple go评论62阅读模式
英文:

Masking a polars dataframe for complex operations

问题

如果我有一个 Polars 数据框架并希望执行遮罩操作,我目前看到两种选项:

选项1:创建过滤的数据框架,执行操作,然后与原始数据框架合并

masked_df = df.filter(mask)
masked_df = masked_df.with_columns(  # 计算一些列
    [
        pl.col("a").sin().alias("new_1"),
        pl.col("a").cos().alias("new_2"),
        (pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # 添加联接
    df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())

选项2:单独对每个操作进行遮罩

res = df.with_columns(  # 计算一些列 - 现在我们必须为每列添加`pl.when(mask).then()`
    [
        pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
        pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
        pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # 我们必须构建一个复杂的来回联接以应用遮罩到联接
    df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)

print(res.collect())

输出:

shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a   ┆ b   ┆ new_1    ┆ new_2     ┆ new_3    ┆ d    │
│ --- ┆ --- ┆ ---      ┆ ---       ┆ ---      ┆ ---  │
│ i64 ┆ i64 ┆ f64      ┆ f64       ┆ f64      ┆ i64  │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1   ┆ 5   ┆ null     ┆ null      ┆ null     ┆ null │
│ 2   ┆ 6   ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16   │
│ 3   ┆ 7   ┆ 0.14112  ┆ -0.989992 ┆ 0.428571 ┆ 17   │
│ 4   ┆ 8   ┆ null     ┆ null      ┆ null     ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘

大多数情况下,选项2会更快,但在涉及复杂性时,与选项1相比,代码会变得更加冗长,通常更难阅读。是否有一种更通用的方法来应用遮罩以覆盖多个连续的操作?

英文:

If I have a polars Dataframe and want to perform masked operations, I currently see two options:

# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = ['a', 'b']).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=["b", "d"]).lazy()

# define mask
mask = pl.col('a').is_between(2, 3)

Option 1: create filtered dataframe, perform operations and join back to the original dataframe

masked_df = df.filter(mask)
masked_df = masked_df.with_columns(  # calculate some columns
    [
        pl.col("a").sin().alias("new_1"),
        pl.col("a").cos().alias("new_2"),
        (pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # throw a join into the mix
    df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())

Option 2: mask each operation individually

res = df.with_columns(  # calculate some columns - we have to add `pl.when(mask).then()` to each column now
    [
        pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
        pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
        pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # we have to construct a convoluted back-and-forth join to apply the mask to the join
    df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)

print(res.collect())

Output:

shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a   ┆ b   ┆ new_1    ┆ new_2     ┆ new_3    ┆ d    │
│ --- ┆ --- ┆ ---      ┆ ---       ┆ ---      ┆ ---  │
│ i64 ┆ i64 ┆ f64      ┆ f64       ┆ f64      ┆ i64  │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1   ┆ 5   ┆ null     ┆ null      ┆ null     ┆ null │
│ 2   ┆ 6   ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16   │
│ 3   ┆ 7   ┆ 0.14112  ┆ -0.989992 ┆ 0.428571 ┆ 17   │
│ 4   ┆ 8   ┆ null     ┆ null      ┆ null     ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘

Most of the time, option 2 will be faster, but it gets pretty verbose and is generally harder to read than option 1 when any sort of complexity is involved.

Is there a way to apply a mask more generically to cover multiple subsequent operations?

答案1

得分: 6

以下是您要翻译的代码部分:

def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
    return [
        pl.when(mask).then(operation)
        for operation in operations
    ]

res = df.with_columns(
    with_mask(
        [
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            pl.col("a") / pl.col("b").alias("new_3"),
        ],
        mask,
    )
)
英文:

You can avoid the boiler plate by applying your mask to your operations in a helper function.


def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
    return [
        pl.when(mask).then(operation)
        for operation in operations
    ]

res = df.with_columns(
    with_mask(
        [
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            pl.col("a") / pl.col("b").alias("new_3"),
        ],
        mask,
    )
)

答案2

得分: 1

你可以使用一个struct结构和unnest来实现。

你的深度优先搜索在懒惰和急切之间不一致,所以我会将它们都变成懒惰。

df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ]).alias('allcols'))).unnest('allcols') \
    .with_columns([pl.when(mask).then(pl.col(x)).otherwise(None) 
                for x in df2.columns if x not in df]) \
    .collect()

我认为你问题的核心是如何编写具有多个列输出的 whenthen,这由第一个 with_columns 处理,然后第二个 with_columns 处理准拟半连接值替换行为。

另一种方法是首先创建一个你希望受到掩码的df2列的列表,然后将它们放入struct中。唯一不太美观的地方是你必须在执行unnest之前排除这些列。

df2_mask_cols = [x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ] + df2_mask_cols).alias('allcols'))) \
    .select(pl.exclude(df2_mask_cols)) \
    .unnest('allcols') \
    .collect()

令人惊讶的是,这种方法是最快的:

df.join(df2, on='b') \
    .with_columns([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ]) \
    .with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()
英文:

You can use a struct with an unnest

Your dfs weren't consistent between being lazy and eager so I'm going to make them both lazy

df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ]).alias('allcols'))).unnest('allcols') \
    .with_columns([pl.when(mask).then(pl.col(x)).otherwise(None) 
                for x in df2.columns if x not in df]) \
    .collect()

I think that's the heart of your question is how to write when then with multiple column outputs which is covered by the first with_columns and then the second with_columns covers the quasi-semi join value replacement behavior.

Another way you can write it is to first create a list of the columns in df2 that you want to be subject to the mask and then put those in the struct. The unsightly thing is that you have to then exclude those columns before you do the unnest

df2_mask_cols=[x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ] + df2_mask_cols).alias('allcols'))) \
    .select(pl.exclude(df2_mask_cols)) \
    .unnest('allcols') \
    .collect()

Surprisingly, this approach was fastest:

df.join(df2, on='b') \
    .with_columns([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") /pl.col("b").cast(pl.Float64())).alias("new_3")
        ]) \
    .with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()

huangapple
  • 本文由 发表于 2023年2月16日 19:37:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/75471704.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定