英文:
python polars: df partition with pivot and concat
问题
这是您提供的代码部分的翻译:
# 我的目标是按一个列(a列)进行分组/分区,创建一个字符串标识符(b和c列),然后将此b_c标识符用作透视数据帧中的列名。
# 根据我所知,下面的代码可以正常工作,但是获得结果的路径有点复杂。所以我的问题是:是否可以以更简单的方式完成这个任务?
# 顺便说一句,对于这个很小的规模(最多1k行),我不追求更快的速度。
data = {
"a": [1, 1, 1, 2, 2, 3],
"b": [11, 12, 13, 11, 12, 11],
"c": ["x1", "x2", "x3", "x1", "x2", "x1"],
"val": [101, 102, 102, 201, 202, 301],
}
df = pl.DataFrame(data)
print(df)
counter = 0
for tmp_df in df.partition_by("a"):
grp_df = (
tmp_df.with_columns((pl.col("b") + "_" + pl.col("c")).alias("col_id"))
.drop(["b", "c"])
.pivot(values="val", index="a", columns="col_id")
)
if counter == 0:
result_df = grp_df.select(pl.all())
else:
result_df = pl.concat([result_df, grp_df], how="diagonal")
counter += 1
print(result_df)
注意:这是您提供的代码的中文翻译,不包括其他类型的回答。
英文:
my goal was to groupby/partition by one column (a below), create a string identifier (b and c columns) then use this b_c identifier as a name for a column in a pivoted data frame.
Code below works OK as far as I can tell, but the path to get the result is a bit twisted. So my question is: can this be done in a simpler way?
BTW, at this tiny scale (max 1k of rows so far) I am not obsessed to make it faster.
data = {
"a": [1, 1, 1, 2, 2, 3],
"b": [11, 12, 13, 11, 12, 11],
"c": ["x1", "x2", "x3", "x1", "x2", "x1"],
"val": [101, 102, 102, 201, 202, 301],
}
df = pl.DataFrame(data)
print(df)
counter = 0
for tmp_df in df.partition_by("a"):
grp_df = (
tmp_df.with_columns((pl.col("b") + "_" + pl.col("c")).alias("col_id"))
.drop(["b", "c"])
.pivot(values="val", index="a", columns="col_id")
)
if counter == 0:
result_df = grp_df.select(pl.all())
else:
result_df = pl.concat([result_df, grp_df], how="diagonal")
counter += 1
print(result_df)
答案1
得分: 2
你可以分为两步完成:首先是选择步骤,创建新的 id
列,然后进行数据透视。
示例 1:
(
df.select(
'a', 'val',
id=pl.col('b').cast(pl.Utf8) + '_' + pl.col('c'))
.pivot(values='val', index='a', columns='id')
)
结果:
shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ a ┆ 11_x1 ┆ 12_x2 ┆ 13_x3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╡
│ 1 ┆ 101 ┆ 102 ┆ 102 │
│ 2 ┆ 201 ┆ 202 ┆ null │
│ 3 ┆ 301 ┆ null ┆ null │
└─────┴───────┴───────┴───────┘
示例 2:
(由 @jqurious 提出),使用 pl.format
(
df.select(
'a', 'val',
id=pl.format("{}_{}", "b", "c"))
.pivot(values='val', index='a', columns='id')
)
英文:
you can do this in 2 step: first a select step to create the new id
column, then the pivot.
Example 1:
(
df.select(
'a','val',
id = pl.col('b').cast(pl.Utf8) + '_' + pl.col('c'))
.pivot(values='val',index='a', columns='id')
)
# Result
shape: (3, 4)
┌─────┬───────┬───────┬───────┐
│ a ┆ 11_x1 ┆ 12_x2 ┆ 13_x3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╡
│ 1 ┆ 101 ┆ 102 ┆ 102 │
│ 2 ┆ 201 ┆ 202 ┆ null │
│ 3 ┆ 301 ┆ null ┆ null │
└─────┴───────┴───────┴───────┘
Example 2:
(suggested by @jqurious), using pl.format
(
df.select(
'a','val',
id = pl.format("{}_{}", "b", "c"))
.pivot(values='val',index='a', columns='id')
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论