英文:
How to add multiple DataFrames with different shapes in polars?
问题
I would like to add multiple DataFrames
with different shapes together.
Before adding the DataFrames
, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).
Here is an example of the inputs:
import polars as pl
a = pl.DataFrame(
data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)
b = pl.DataFrame(
data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)
c = pl.DataFrame(
data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)
The expected result would be:
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ 0 │
│ 3 ┆ 1 ┆ 2 ┆ 0 │
│ 4 ┆ 10 ┆ 0 ┆ 1 │
│ 5 ┆ 10 ┆ 0 ┆ 1 │
└───────┴───────┴───────┴───────┘
The order of the columns is not a concern.
Here is a solution but it seems a little bit clunky:
from functools import reduce
columns = set()
for df in [a, b, c]:
for column in df.columns:
columns.add(column)
reshaped_df = []
for df in [a, b, c]:
for column in columns:
if column not in df.columns:
df = df.with_columns(pl.lit(0).alias(column))
reshaped_df.append(df)
reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)
index = reshaped_df[0].select("index").to_series()
result = reduce(
lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
reshaped_df).hstack([index])
英文:
I would like to add multiple DataFrames
with different shapes together.
Before adding the DataFrames
, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).
Here is an example of the inputs:
import polars as pl
a = pl.DataFrame(
data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)
b = pl.DataFrame(
data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)
c = pl.DataFrame(
data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)
The expected result would be:
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ 0 │
│ 3 ┆ 1 ┆ 2 ┆ 0 │
│ 4 ┆ 10 ┆ 0 ┆ 1 │
│ 5 ┆ 10 ┆ 0 ┆ 1 │
└───────┴───────┴───────┴───────┘
The order of the columns is not a concern.
Here is a solution but it seems a little bit clunky:
from functools import reduce
columns = set()
for df in [a, b, c]:
for column in df.columns:
columns.add(column)
reshaped_df = []
for df in [a, b, c]:
for column in columns:
if column not in df.columns:
df = df.with_columns(pl.lit(0).alias(column))
reshaped_df.append(df)
reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)
index = reshaped_df[0].select("index").to_series()
result = reduce(
lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
reshaped_df).hstack([index])
答案1
得分: 4
以下是翻译好的部分:
pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ null │
│ 3 ┆ 1 ┆ 2 ┆ null │
│ 4 ┆ 10 ┆ null ┆ 1 │
│ 5 ┆ 10 ┆ null ┆ 1 │
└───────┴───────┴───────┴───────┘
英文:
There's also pl.concat(how="diagonal")
pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ null │
│ 3 ┆ 1 ┆ 2 ┆ null │
│ 4 ┆ 10 ┆ null ┆ 1 │
│ 5 ┆ 10 ┆ null ┆ 1 │
└───────┴───────┴───────┴───────┘
答案2
得分: 2
以下是翻译好的代码部分:
我不确定这是否是最优解,但使用你的数据框 `a`、`b` 和 `c`,你可以这样做:
```python
for i, df in enumerate((b, c)):
mapping = {c: f"{c}_{i}" for c in df.columns if c != "index"}
a = a.join(df.rename(mapping), on="index", how="outer")
a = a.fill_null(0).select([pl.col("index")] + [
pl.sum(pl.col(f"^col_{i}.*$")).alias(f"col_{i}") for i in (1, 2, 3)
]).sort(by="index")
以获得如下结果:
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ 0 │
│ 3 ┆ 1 ┆ 2 ┆ 0 │
│ 4 ┆ 10 ┆ 0 ┆ 1 │
│ 5 ┆ 10 ┆ 0 ┆ 1 │
└───────┴───────┴───────┴───────┘
首先在 index
上进行外连接,并使用修改后的列名。然后对应列名的列进行求和,这些列名以特定前缀开始。
<details>
<summary>英文:</summary>
I'm not sure this is optimal, but with your dataframes `a`, `b` and `c` you could do
```python
for i, df in enumerate((b, c)):
mapping = {c: f"{c}_{i}" for c in df.columns if c != "index"}
a = a.join(df.rename(mapping), on="index", how="outer")
a = a.fill_null(0).select([pl.col("index")] + [
pl.sum(pl.col(f"^col_{i}.*$")).alias(f"col_{i}") for i in (1, 2, 3)
]).sort(by="index")
to get
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ 12 ┆ 2 ┆ 1 │
│ 2 ┆ 1 ┆ 2 ┆ 0 │
│ 3 ┆ 1 ┆ 2 ┆ 0 │
│ 4 ┆ 10 ┆ 0 ┆ 1 │
│ 5 ┆ 10 ┆ 0 ┆ 1 │
└───────┴───────┴───────┴───────┘
So first outer-join the dataframes on index
with modified column names. Then sum over the corresponding columns, identified by the start of the column names.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论