如何在polars中添加具有不同形状的多个DataFrame?

huangapple go评论66阅读模式
英文:

How to add multiple DataFrames with different shapes in polars?

问题

I would like to add multiple DataFrames with different shapes together.

Before adding the DataFrames, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).

Here is an example of the inputs:

import polars as pl

a = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)

b = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)

c = pl.DataFrame(
    data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)

The expected result would be:

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
 index  col_1  col_2  col_3 
 ---    ---    ---    ---   
 i64    i64    i64    i64   
╞═══════╪═══════╪═══════╪═══════╡
 1      12     2      1     
 2      1      2      0     
 3      1      2      0     
 4      10     0      1     
 5      10     0      1     
└───────┴───────┴───────┴───────┘

The order of the columns is not a concern.

Here is a solution but it seems a little bit clunky:

from functools import reduce

columns = set()

for df in [a, b, c]:
    for column in df.columns:
        columns.add(column)

reshaped_df = []

for df in [a, b, c]:
    for column in columns:
        if column not in df.columns:
            df = df.with_columns(pl.lit(0).alias(column))
            reshaped_df.append(df)

reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)

index = reshaped_df[0].select("index").to_series()

result = reduce(
    lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
    reshaped_df).hstack([index])
英文:

I would like to add multiple DataFrames with different shapes together.

Before adding the DataFrames, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).

Here is an example of the inputs:

import polars as pl

a = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)

b = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)

c = pl.DataFrame(
    data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)

The expected result would be:

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

The order of the columns is not a concern.

Here is a solution but it seems a little bit clunky:

from functools import reduce

columns = set()

for df in [a, b, c]:
    for column in df.columns:
        columns.add(column)

reshaped_df = []

for df in [a, b, c]:
    for column in columns:
        if column not in df.columns:
            df = df.with_columns(pl.lit(0).alias(column))
            reshaped_df.append(df)

reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)

index = reshaped_df[0].select("index").to_series()

result = reduce(
    lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
    reshaped_df).hstack([index])

答案1

得分: 4

以下是翻译好的部分:

还有 pl.concat(how="diagonal")

pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
 index  col_1  col_2  col_3 
 ---    ---    ---    ---   
 i64    i64    i64    i64   
╞═══════╪═══════╪═══════╪═══════╡
 1      12     2      1     
 2      1      2      null  
 3      1      2      null  
 4      10     null   1     
 5      10     null   1     
└───────┴───────┴───────┴───────┘
英文:

There's also pl.concat(how="diagonal")

pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()
shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ null  │
│ 3     ┆ 1     ┆ 2     ┆ null  │
│ 4     ┆ 10    ┆ null  ┆ 1     │
│ 5     ┆ 10    ┆ null  ┆ 1     │
└───────┴───────┴───────┴───────┘

答案2

得分: 2

以下是翻译好的代码部分:

我不确定这是否是最优解但使用你的数据框 `a`、`b``c`,你可以这样做
```python
for i, df in enumerate((b, c)):
    mapping = {c: f"{c}_{i}" for c in df.columns if c != "index"}
    a = a.join(df.rename(mapping), on="index", how="outer")
a = a.fill_null(0).select([pl.col("index")] + [
    pl.sum(pl.col(f"^col_{i}.*$")).alias(f"col_{i}") for i in (1, 2, 3)
]).sort(by="index")

以获得如下结果:

┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

首先在 index 上进行外连接,并使用修改后的列名。然后对应列名的列进行求和,这些列名以特定前缀开始。


<details>
<summary>英文:</summary>

I&#39;m not sure this is optimal, but with your dataframes `a`, `b` and `c` you could do
```python
for i, df in enumerate((b, c)):
    mapping = {c: f&quot;{c}_{i}&quot; for c in df.columns if c != &quot;index&quot;}
    a = a.join(df.rename(mapping), on=&quot;index&quot;, how=&quot;outer&quot;)
a = a.fill_null(0).select([pl.col(&quot;index&quot;)] + [
    pl.sum(pl.col(f&quot;^col_{i}.*$&quot;)).alias(f&quot;col_{i}&quot;) for i in (1, 2, 3)
]).sort(by=&quot;index&quot;)

to get

┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

So first outer-join the dataframes on index with modified column names. Then sum over the corresponding columns, identified by the start of the column names.

huangapple
  • 本文由 发表于 2023年5月14日 17:13:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/76246692.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定