英文:
How to explode dataframe after groupby aggregation?
问题
以下是翻译好的部分:
一个经过groupby操作后的典型数据框可能如下所示:
import polars as pl
pl.DataFrame(
[
pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
]
)
形状: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ list[i64] ┆ list[i64] │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1 ┆ 10 ┆ 11 ┆ [6, 8] ┆ [10, 9] │
│ chr1 ┆ 1 ┆ 4 ┆ [0, 5] ┆ [2, 7] │
│ chr1 ┆ 4 ┆ 5 ┆ [5, 0] ┆ [7, 2] │
│ chr1 ┆ 7 ┆ 8 ┆ [6, 5] ┆ [10, 7] │
└────────────┴────────┴──────┴──────────────┴────────────┘
如何以最经济的方式展开数据框?使每个标量条目重复两次,每个列表中的每个项目都作为标量值列出,单独占用一行。我猜这个操作非常常见,应该有内置的方法。
即:
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1 ┆ 10 ┆ 11 ┆ 6 ┆ 10 │
│ chr1 ┆ 10 ┆ 11 ┆ 8 ┆ 9 │
│ chr1 ┆ 1 ┆ 4 ┆ 0 ┆ 2 │
│ chr1 ┆ 1 ┆ 4 ┆ 5 ┆ 7 │
...
请注意,代码部分没有被翻译。
英文:
A typical dataframe after a groupby might look like:
import polars as pl
pl.DataFrame(
[
pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
]
)
shape: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ list[i64] ┆ list[i64] │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1 ┆ 10 ┆ 11 ┆ [6, 8] ┆ [10, 9] │
│ chr1 ┆ 1 ┆ 4 ┆ [0, 5] ┆ [2, 7] │
│ chr1 ┆ 4 ┆ 5 ┆ [5, 0] ┆ [7, 2] │
│ chr1 ┆ 7 ┆ 8 ┆ [6, 5] ┆ [10, 7] │
└────────────┴────────┴──────┴──────────────┴────────────┘
How do I explode the data frame in the least expensive way? So that each scalar entry is repeated twice and each item in each list is listed once as a scalar value, on its own row. I guess this operation is so common it should be built in somehow.
I.e.
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1 ┆ 10 ┆ 11 ┆ 6 ┆ 10 │
│ chr1 ┆ 10 ┆ 11 ┆ 8 ┆ 9 │
│ chr1 ┆ 1 ┆ 4 ┆ 0 ┆ 2 │
│ chr1 ┆ 1 ┆ 4 ┆ 5 ┆ 7 │
...
答案1
得分: 2
Polars确实已经支持这个功能。只需使用.explode
并传递要展开的列即可。
df.explode(['starts_right','ends_right'])
chromosome | starts | ends | starts_right | ends_right |
---|---|---|---|---|
"chr1" | 10 | 11 | 6 | 10 |
"chr1" | 10 | 11 | 8 | 9 |
"chr1" | 1 | 4 | 0 | 2 |
"chr1" | 1 | 4 | 5 | 7 |
"chr1" | 4 | 5 | 5 | 7 |
"chr1" | 4 | 5 | 0 | 2 |
"chr1" | 7 | 8 | 6 | 10 |
"chr1" | 7 | 8 | 5 | 7 |
英文:
Polars indeed already has this. Just do .explode and pass the columns which you want to explode on.
df.explode(['starts_right','ends_right'])
chromosome | starts | ends | starts_right | ends_right |
---|---|---|---|---|
"chr1" | 10 | 11 | 6 | 10 |
"chr1" | 10 | 11 | 8 | 9 |
"chr1" | 1 | 4 | 0 | 2 |
"chr1" | 1 | 4 | 5 | 7 |
"chr1" | 4 | 5 | 5 | 7 |
"chr1" | 4 | 5 | 0 | 2 |
"chr1" | 7 | 8 | 6 | 10 |
"chr1" | 7 | 8 | 5 | 7 |
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论