2023年6月26日 23:03:16go评论78阅读模式

英文:

How to explode dataframe after groupby aggregation?

问题

以下是翻译好的部分：

一个经过groupby操作后的典型数据框可能如下所示：
import polars as pl
pl.DataFrame(
    [
        pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
        pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
        pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
        pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
        pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
    ]
)
形状: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ list[i64]    ┆ list[i64]  │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ [6, 8]       ┆ [10, 9]    │
│ chr1       ┆ 1      ┆ 4    ┆ [0, 5]       ┆ [2, 7]     │
│ chr1       ┆ 4      ┆ 5    ┆ [5, 0]       ┆ [7, 2]     │
│ chr1       ┆ 7      ┆ 8    ┆ [6, 5]       ┆ [10, 7]    │
└────────────┴────────┴──────┴──────────────┴────────────┘
如何以最经济的方式展开数据框？使每个标量条目重复两次，每个列表中的每个项目都作为标量值列出，单独占用一行。我猜这个操作非常常见，应该有内置的方法。
即：
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ i64          ┆ i64        │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ 6            ┆ 10         │
│ chr1       ┆ 10     ┆ 11   ┆ 8            ┆ 9          │
│ chr1       ┆ 1      ┆ 4    ┆ 0            ┆ 2          │
│ chr1       ┆ 1      ┆ 4    ┆ 5            ┆ 7          │
...

请注意，代码部分没有被翻译。

英文:

A typical dataframe after a groupby might look like:

import polars as pl
pl.DataFrame(
[
pl.Series(&quot;chromosome&quot;, [&#39;chr1&#39;, &#39;chr1&#39;, &#39;chr1&#39;, &#39;chr1&#39;], dtype=pl.Utf8),
pl.Series(&quot;starts&quot;, [10, 1, 4, 7], dtype=pl.Int64),
pl.Series(&quot;ends&quot;, [11, 4, 5, 8], dtype=pl.Int64),
pl.Series(&quot;starts_right&quot;, [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
pl.Series(&quot;ends_right&quot;, [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
]
)
shape: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ list[i64]    ┆ list[i64]  │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ [6, 8]       ┆ [10, 9]    │
│ chr1       ┆ 1      ┆ 4    ┆ [0, 5]       ┆ [2, 7]     │
│ chr1       ┆ 4      ┆ 5    ┆ [5, 0]       ┆ [7, 2]     │
│ chr1       ┆ 7      ┆ 8    ┆ [6, 5]       ┆ [10, 7]    │
└────────────┴────────┴──────┴──────────────┴────────────┘

How do I explode the data frame in the least expensive way? So that each scalar entry is repeated twice and each item in each list is listed once as a scalar value, on its own row. I guess this operation is so common it should be built in somehow.

I.e.

┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ i64          ┆ i64        │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ 6            ┆ 10         │
│ chr1       ┆ 10     ┆ 11   ┆ 8            ┆ 9          │
│ chr1       ┆ 1      ┆ 4    ┆ 0            ┆ 2          │
│ chr1       ┆ 1      ┆ 4    ┆ 5            ┆ 7          │
...

答案1

得分: 2

Polars确实已经支持这个功能。只需使用.explode并传递要展开的列即可。

df.explode(['starts_right','ends_right'])

chromosome	starts	ends	starts_right	ends_right
"chr1"	10	11	6	10
"chr1"	10	11	8	9
"chr1"	1	4	0	2
"chr1"	1	4	5	7
"chr1"	4	5	5	7
"chr1"	4	5	0	2
"chr1"	7	8	6	10
"chr1"	7	8	5	7

英文:

Polars indeed already has this. Just do .explode and pass the columns which you want to explode on.

df.explode([&#39;starts_right&#39;,&#39;ends_right&#39;])

chromosome	starts	ends	starts_right	ends_right
"chr1"	10	11	6	10
"chr1"	10	11	8	9
"chr1"	1	4	0	2
"chr1"	1	4	5	7
"chr1"	4	5	5	7
"chr1"	4	5	0	2
"chr1"	7	8	6	10
"chr1"	7	8	5	7

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在分组聚合后展开数据框？

问题

答案1

数据类型无法写入CSV。

使用 Polars 根据另一列的条件修改某列的一些行。

在 Polars 中基于条件检测缺失值。

如何在 Polars 的 .when 条件中应用和/或布尔逻辑？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。