2023年5月14日 17:13:47go评论66阅读模式

英文:

How to add multiple DataFrames with different shapes in polars?

问题

I would like to add multiple DataFrames with different shapes together.

Before adding the DataFrames, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).

Here is an example of the inputs:

import polars as pl

a = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 0, 0], "col_2": [1, 1, 1]}
)

b = pl.DataFrame(
    data={"index": [1, 2, 3], "col_1": [1, 1, 1], "col_2": [1, 1, 1]}
)

c = pl.DataFrame(
    data={"index": [1, 4, 5], "col_1": [10, 10, 10], "col_3": [1, 1, 1]}
)

The expected result would be:

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

The order of the columns is not a concern.

Here is a solution but it seems a little bit clunky:

from functools import reduce

columns = set()

for df in [a, b, c]:
    for column in df.columns:
        columns.add(column)

reshaped_df = []

for df in [a, b, c]:
    for column in columns:
        if column not in df.columns:
            df = df.with_columns(pl.lit(0).alias(column))
            reshaped_df.append(df)

reshaped_df = pl.align_frames(*reshaped_df, on="index", select=columns)

index = reshaped_df[0].select("index").to_series()

result = reduce(
    lambda a, b: a.select(pl.exclude("index").fill_null(value=0)) + b.select(pl.exclude("index").fill_null(value=0)),
    reshaped_df).hstack([index])

英文:

I would like to add multiple DataFrames with different shapes together.

Before adding the DataFrames, the idea would be to reshape them by adding the missing rows (using an "index" column as the reference) and the missing columns (filled with 0).

Here is an example of the inputs:

import polars as pl

a = pl.DataFrame(
    data={&quot;index&quot;: [1, 2, 3], &quot;col_1&quot;: [1, 0, 0], &quot;col_2&quot;: [1, 1, 1]}
)

b = pl.DataFrame(
    data={&quot;index&quot;: [1, 2, 3], &quot;col_1&quot;: [1, 1, 1], &quot;col_2&quot;: [1, 1, 1]}
)

c = pl.DataFrame(
    data={&quot;index&quot;: [1, 4, 5], &quot;col_1&quot;: [10, 10, 10], &quot;col_3&quot;: [1, 1, 1]}
)

The expected result would be:

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

The order of the columns is not a concern.

Here is a solution but it seems a little bit clunky:

from functools import reduce

columns = set()

for df in [a, b, c]:
    for column in df.columns:
        columns.add(column)

reshaped_df = []

for df in [a, b, c]:
    for column in columns:
        if column not in df.columns:
            df = df.with_columns(pl.lit(0).alias(column))
            reshaped_df.append(df)

reshaped_df = pl.align_frames(*reshaped_df, on=&quot;index&quot;, select=columns)

index = reshaped_df[0].select(&quot;index&quot;).to_series()

result = reduce(
    lambda a, b: a.select(pl.exclude(&quot;index&quot;).fill_null(value=0)) + b.select(pl.exclude(&quot;index&quot;).fill_null(value=0)),
    reshaped_df).hstack([index])

答案1

得分: 4

以下是翻译好的部分：

还有 pl.concat(how="diagonal")

pl.concat([a, b, c], how="diagonal").groupby("index", maintain_order=True).sum()

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ null  │
│ 3     ┆ 1     ┆ 2     ┆ null  │
│ 4     ┆ 10    ┆ null  ┆ 1     │
│ 5     ┆ 10    ┆ null  ┆ 1     │
└───────┴───────┴───────┴───────┘

英文:

There's also pl.concat(how="diagonal")

pl.concat([a, b, c], how=&quot;diagonal&quot;).groupby(&quot;index&quot;, maintain_order=True).sum()

shape: (5, 4)
┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ null  │
│ 3     ┆ 1     ┆ 2     ┆ null  │
│ 4     ┆ 10    ┆ null  ┆ 1     │
│ 5     ┆ 10    ┆ null  ┆ 1     │
└───────┴───────┴───────┴───────┘

答案2

得分: 2

以下是翻译好的代码部分：

我不确定这是否是最优解，但使用你的数据框 `a`、`b` 和 `c`，你可以这样做：
```python
for i, df in enumerate((b, c)):
    mapping = {c: f"{c}_{i}" for c in df.columns if c != "index"}
    a = a.join(df.rename(mapping), on="index", how="outer")
a = a.fill_null(0).select([pl.col("index")] + [
    pl.sum(pl.col(f"^col_{i}.*$")).alias(f"col_{i}") for i in (1, 2, 3)
]).sort(by="index")

以获得如下结果：

┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

首先在 index 上进行外连接，并使用修改后的列名。然后对应列名的列进行求和，这些列名以特定前缀开始。


<details>
<summary>英文:</summary>

I&#39;m not sure this is optimal, but with your dataframes `a`, `b` and `c` you could do
```python
for i, df in enumerate((b, c)):
    mapping = {c: f&quot;{c}_{i}&quot; for c in df.columns if c != &quot;index&quot;}
    a = a.join(df.rename(mapping), on=&quot;index&quot;, how=&quot;outer&quot;)
a = a.fill_null(0).select([pl.col(&quot;index&quot;)] + [
    pl.sum(pl.col(f&quot;^col_{i}.*$&quot;)).alias(f&quot;col_{i}&quot;) for i in (1, 2, 3)
]).sort(by=&quot;index&quot;)

to get

┌───────┬───────┬───────┬───────┐
│ index ┆ col_1 ┆ col_2 ┆ col_3 │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 12    ┆ 2     ┆ 1     │
│ 2     ┆ 1     ┆ 2     ┆ 0     │
│ 3     ┆ 1     ┆ 2     ┆ 0     │
│ 4     ┆ 10    ┆ 0     ┆ 1     │
│ 5     ┆ 10    ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┘

So first outer-join the dataframes on index with modified column names. Then sum over the corresponding columns, identified by the start of the column names.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在polars中添加具有不同形状的多个DataFrame？

问题

答案1

答案2

GitHub仓库中的Python项目包文件夹应包括什么以管理依赖关系？

如何在Python中使用正则表达式提取给定代码中的键和值对？

Python安装在哪里？

PyO3 – 如何将枚举返回给 Python 模块？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论