英文:
Masking a polars dataframe for complex operations
问题
如果我有一个 Polars 数据框架并希望执行遮罩操作,我目前看到两种选项:
选项1:创建过滤的数据框架,执行操作,然后与原始数据框架合并
masked_df = df.filter(mask)
masked_df = masked_df.with_columns( # 计算一些列
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # 添加联接
df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())
选项2:单独对每个操作进行遮罩
res = df.with_columns( # 计算一些列 - 现在我们必须为每列添加`pl.when(mask).then()`
[
pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # 我们必须构建一个复杂的来回联接以应用遮罩到联接
df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)
print(res.collect())
输出:
shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a ┆ b ┆ new_1 ┆ new_2 ┆ new_3 ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1 ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 6 ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16 │
│ 3 ┆ 7 ┆ 0.14112 ┆ -0.989992 ┆ 0.428571 ┆ 17 │
│ 4 ┆ 8 ┆ null ┆ null ┆ null ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘
大多数情况下,选项2会更快,但在涉及复杂性时,与选项1相比,代码会变得更加冗长,通常更难阅读。是否有一种更通用的方法来应用遮罩以覆盖多个连续的操作?
英文:
If I have a polars Dataframe and want to perform masked operations, I currently see two options:
# create data
df = pl.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], schema = ['a', 'b']).lazy()
# create a second dataframe for added fun
df2 = pl.DataFrame([[8, 6, 7, 5], [15, 16, 17, 18]], schema=["b", "d"]).lazy()
# define mask
mask = pl.col('a').is_between(2, 3)
Option 1: create filtered dataframe, perform operations and join back to the original dataframe
masked_df = df.filter(mask)
masked_df = masked_df.with_columns( # calculate some columns
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # throw a join into the mix
df2, on="b", how="left"
)
res = df.join(masked_df, how="left", on=["a", "b"])
print(res.collect())
Option 2: mask each operation individually
res = df.with_columns( # calculate some columns - we have to add `pl.when(mask).then()` to each column now
[
pl.when(mask).then(pl.col("a").sin()).alias("new_1"),
pl.when(mask).then(pl.col("a").cos()).alias("new_2"),
pl.when(mask).then(pl.col("a") / pl.col("b")).alias("new_3"),
]
).join( # we have to construct a convoluted back-and-forth join to apply the mask to the join
df2.join(df.filter(mask), on="b", how="semi"), on="b", how="left"
)
print(res.collect())
Output:
shape: (4, 6)
┌─────┬─────┬──────────┬───────────┬──────────┬──────┐
│ a ┆ b ┆ new_1 ┆ new_2 ┆ new_3 ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 │
╞═════╪═════╪══════════╪═══════════╪══════════╪══════╡
│ 1 ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 6 ┆ 0.909297 ┆ -0.416147 ┆ 0.333333 ┆ 16 │
│ 3 ┆ 7 ┆ 0.14112 ┆ -0.989992 ┆ 0.428571 ┆ 17 │
│ 4 ┆ 8 ┆ null ┆ null ┆ null ┆ null │
└─────┴─────┴──────────┴───────────┴──────────┴──────┘
Most of the time, option 2 will be faster, but it gets pretty verbose and is generally harder to read than option 1 when any sort of complexity is involved.
Is there a way to apply a mask more generically to cover multiple subsequent operations?
答案1
得分: 6
以下是您要翻译的代码部分:
def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
return [
pl.when(mask).then(operation)
for operation in operations
]
res = df.with_columns(
with_mask(
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
pl.col("a") / pl.col("b").alias("new_3"),
],
mask,
)
)
英文:
You can avoid the boiler plate by applying your mask to your operations in a helper function.
def with_mask(operations: list[pl.Expr], mask) -> list[pl.Expr]:
return [
pl.when(mask).then(operation)
for operation in operations
]
res = df.with_columns(
with_mask(
[
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
pl.col("a") / pl.col("b").alias("new_3"),
],
mask,
)
)
答案2
得分: 1
你可以使用一个struct
结构和unnest
来实现。
你的深度优先搜索在懒惰和急切之间不一致,所以我会将它们都变成懒惰。
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
]).alias('allcols'))).unnest('allcols') \
.with_columns([pl.when(mask).then(pl.col(x)).otherwise(None)
for x in df2.columns if x not in df]) \
.collect()
我认为你问题的核心是如何编写具有多个列输出的 when
和 then
,这由第一个 with_columns
处理,然后第二个 with_columns
处理准拟半连接值替换行为。
另一种方法是首先创建一个你希望受到掩码的df2列的列表,然后将它们放入struct
中。唯一不太美观的地方是你必须在执行unnest
之前排除这些列。
df2_mask_cols = [x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
] + df2_mask_cols).alias('allcols'))) \
.select(pl.exclude(df2_mask_cols)) \
.unnest('allcols') \
.collect()
令人惊讶的是,这种方法是最快的:
df.join(df2, on='b') \
.with_columns([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
]) \
.with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()
英文:
You can use a struct
with an unnest
Your dfs weren't consistent between being lazy and eager so I'm going to make them both lazy
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
]).alias('allcols'))).unnest('allcols') \
.with_columns([pl.when(mask).then(pl.col(x)).otherwise(None)
for x in df2.columns if x not in df]) \
.collect()
I think that's the heart of your question is how to write when
then
with multiple column outputs which is covered by the first with_columns
and then the second with_columns
covers the quasi-semi join value replacement behavior.
Another way you can write it is to first create a list of the columns in df2 that you want to be subject to the mask and then put those in the struct. The unsightly thing is that you have to then exclude those columns before you do the unnest
df2_mask_cols=[x for x in df2.columns if x not in df.columns]
df.join(df2, on='b') \
.with_columns(pl.when(mask).then(
pl.struct([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") / pl.col("b").cast(pl.Float64())).alias("new_3")
] + df2_mask_cols).alias('allcols'))) \
.select(pl.exclude(df2_mask_cols)) \
.unnest('allcols') \
.collect()
Surprisingly, this approach was fastest:
df.join(df2, on='b') \
.with_columns([
pl.col("a").sin().alias("new_1"),
pl.col("a").cos().alias("new_2"),
(pl.col("a") /pl.col("b").cast(pl.Float64())).alias("new_3")
]) \
.with_columns(pl.when(mask).then(pl.exclude(df.columns))).collect()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论