2023年2月16日 19:37:01go评论152阅读模式

英文:

Masking a polars dataframe for complex operations

问题

如果我有一个 Polars 数据框架并希望执行遮罩操作，我目前看到两种选项：

选项1：创建过滤的数据框架，执行操作，然后与原始数据框架合并

masked_df = df.filter(mask)
masked_df = masked_df.with_columns(  # 计算一些列
    [
        pl.col("a").sin().alias("new_1"),
        pl.col("a").cos().alias("new_2"),
        (pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # 添加联接
    df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())

选项2：单独对每个操作进行遮罩

res = df.with_columns(  # 计算一些列 - 现在我们必须为每列添加`pl.when(mask).then()`
    [
        pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
        pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
        pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
    ]
).join(  # 我们必须构建一个复杂的来回联接以应用遮罩到联接
    df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)

print(res.collect())

输出:

shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a   ┆ b   ┆ new_1    ┆ new_2     ┆ new_3    ┆ d    │
│ --- ┆ --- ┆ ---      ┆ ---       ┆ ---      ┆ ---  │
│ i64 ┆ i64 ┆ f64      ┆ f64       ┆ f64      ┆ i64  │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1   ┆ 5   ┆ null     ┆ null      ┆ null     ┆ null │
│ 2   ┆ 6   ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16   │
│ 3   ┆ 7   ┆ 0.14112  ┆ -0.989992 ┆ 0.428571 ┆ 17   │
│ 4   ┆ 8   ┆ null     ┆ null      ┆ null     ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘

大多数情况下，选项2会更快，但在涉及复杂性时，与选项1相比，代码会变得更加冗长，通常更难阅读。是否有一种更通用的方法来应用遮罩以覆盖多个连续的操作？

英文:

If I have a polars Dataframe and want to perform masked operations, I currently see two options:

# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = [&#39;a&#39;, &#39;b&#39;]).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=[&quot;b&quot;, &quot;d&quot;]).lazy()

# define mask
mask = pl.col(&#39;a&#39;).is_between(2, 3)

Option 1: create filtered dataframe, perform operations and join back to the original dataframe

masked_df = df.filter(mask)
masked_df = masked_df.with_columns(  # calculate some columns
    [
        pl.col(&quot;a&quot;).sin().alias(&quot;new_1&quot;),
        pl.col(&quot;a&quot;).cos().alias(&quot;new_2&quot;),
        (pl.col(&quot;a&quot;) / pl.col(&quot;b&quot;)).alias(&quot;new_3&quot;),
    ]
).join(  # throw a join into the mix
    df2, on=&quot;b&quot;, how=&quot;left&quot;
)
res = df.join(masked_df, how=&quot;left&quot;, on=[&quot;a&quot;, &quot;b&quot;])
print(res.collect())

Option 2: mask each operation individually

res = df.with_columns(  # calculate some columns - we have to add `pl.when(mask).then()` to each column now
    [
        pl.when(mask).then(pl.col(&quot;a&quot;).sin()).alias(&quot;new_1&quot;),
        pl.when(mask).then(pl.col(&quot;a&quot;).cos()).alias(&quot;new_2&quot;),
        pl.when(mask).then(pl.col(&quot;a&quot;) / pl.col(&quot;b&quot;)).alias(&quot;new_3&quot;),
    ]
).join(  # we have to construct a convoluted back-and-forth join to apply the mask to the join
    df2.join(df.filter(mask), on=&quot;b&quot;, how=&quot;semi&quot;), on=&quot;b&quot;, how=&quot;left&quot;
)

print(res.collect())

Output:

shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a   ┆ b   ┆ new_1    ┆ new_2     ┆ new_3    ┆ d    │
│ --- ┆ --- ┆ ---      ┆ ---       ┆ ---      ┆ ---  │
│ i64 ┆ i64 ┆ f64      ┆ f64       ┆ f64      ┆ i64  │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1   ┆ 5   ┆ null     ┆ null      ┆ null     ┆ null │
│ 2   ┆ 6   ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16   │
│ 3   ┆ 7   ┆ 0.14112  ┆ -0.989992 ┆ 0.428571 ┆ 17   │
│ 4   ┆ 8   ┆ null     ┆ null      ┆ null     ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘

Most of the time, option 2 will be faster, but it gets pretty verbose and is generally harder to read than option 1 when any sort of complexity is involved.

Is there a way to apply a mask more generically to cover multiple subsequent operations?

答案1

得分: 6

以下是您要翻译的代码部分：

def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
    return [
        pl.when(mask).then(operation)
        for operation in operations
    ]

res = df.with_columns(
    with_mask(
        [
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            pl.col("a") / pl.col("b").alias("new_3"),
        ],
        mask,
    )
)

英文:

You can avoid the boiler plate by applying your mask to your operations in a helper function.


def with_mask(operations: list[pl.Expr], mask) -&gt; list[pl.Expr]:
    return [
        pl.when(mask).then(operation)
        for operation in operations
    ]

res = df.with_columns(
    with_mask(
        [
            pl.col(&quot;a&quot;).sin().alias(&quot;new_1&quot;),
            pl.col(&quot;a&quot;).cos().alias(&quot;new_2&quot;),
            pl.col(&quot;a&quot;) / pl.col(&quot;b&quot;).alias(&quot;new_3&quot;),
        ],
        mask,
    )
)

答案2

得分: 1

你可以使用一个struct结构和unnest来实现。

你的深度优先搜索在懒惰和急切之间不一致，所以我会将它们都变成懒惰。

df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ]).alias('allcols'))).unnest('allcols') \
    .with_columns([pl.when(mask).then(pl.col(x)).otherwise(None) 
                for x in df2.columns if x not in df]) \
    .collect()

我认为你问题的核心是如何编写具有多个列输出的 when 和 then，这由第一个 with_columns 处理，然后第二个 with_columns 处理准拟半连接值替换行为。

另一种方法是首先创建一个你希望受到掩码的df2列的列表，然后将它们放入struct中。唯一不太美观的地方是你必须在执行unnest之前排除这些列。

df2_mask_cols = [x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ] + df2_mask_cols).alias('allcols'))) \
    .select(pl.exclude(df2_mask_cols)) \
    .unnest('allcols') \
    .collect()

令人惊讶的是，这种方法是最快的：

df.join(df2, on='b') \
    .with_columns([
            pl.col("a").sin().alias("new_1"),
            pl.col("a").cos().alias("new_2"),
            (pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
        ]) \
    .with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()

英文:

You can use a struct with an unnest

Your dfs weren't consistent between being lazy and eager so I'm going to make them both lazy

df.join(df2, on=&#39;b&#39;) \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col(&quot;a&quot;).sin().alias(&quot;new_1&quot;),
            pl.col(&quot;a&quot;).cos().alias(&quot;new_2&quot;),
            (pl.col(&quot;a&quot;) / pl.col(&quot;b&quot;).cast(pl.Float64())).alias(&quot;new_3&quot;)
        ]).alias(&#39;allcols&#39;))).unnest(&#39;allcols&#39;) \
    .with_columns([pl.when(mask).then(pl.col(x)).otherwise(None) 
                for x in df2.columns if x not in df]) \
    .collect()

I think that's the heart of your question is how to write when then with multiple column outputs which is covered by the first with_columns and then the second with_columns covers the quasi-semi join value replacement behavior.

Another way you can write it is to first create a list of the columns in df2 that you want to be subject to the mask and then put those in the struct. The unsightly thing is that you have to then exclude those columns before you do the unnest

df2_mask_cols=[x for x in df2.columns if x not in df.columns]
df.join(df2, on=&#39;b&#39;) \
    .with_columns(pl.when(mask).then(
        pl.struct([
            pl.col(&quot;a&quot;).sin().alias(&quot;new_1&quot;),
            pl.col(&quot;a&quot;).cos().alias(&quot;new_2&quot;),
            (pl.col(&quot;a&quot;) / pl.col(&quot;b&quot;).cast(pl.Float64())).alias(&quot;new_3&quot;)
        ] + df2_mask_cols).alias(&#39;allcols&#39;))) \
    .select(pl.exclude(df2_mask_cols)) \
    .unnest(&#39;allcols&#39;) \
    .collect()

Surprisingly, this approach was fastest:

df.join(df2, on=&#39;b&#39;) \
    .with_columns([
            pl.col(&quot;a&quot;).sin().alias(&quot;new_1&quot;),
            pl.col(&quot;a&quot;).cos().alias(&quot;new_2&quot;),
            (pl.col(&quot;a&quot;) /pl.col(&quot;b&quot;).cast(pl.Float64())).alias(&quot;new_3&quot;)
        ]) \
    .with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

遮蔽 polars 数据框以进行复杂操作

问题

Option 1: create filtered dataframe, perform operations and join back to the original dataframe

Option 2: mask each operation individually

Output:

答案1

答案2

文本框不会移动

在Python中使用Selenium查找元素

如何使用Python将此对话分隔成每行一条记录？

多行字符串包含数字

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论