如何在分组聚合后展开数据框?

huangapple go评论48阅读模式
英文:

How to explode dataframe after groupby aggregation?

问题

以下是翻译好的部分:

一个经过groupby操作后的典型数据框可能如下所示

import polars as pl
pl.DataFrame(
    [
        pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
        pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
        pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
        pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
        pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
    ]
)

形状: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
 chromosome  starts  ends  starts_right  ends_right 
 ---         ---     ---   ---           ---        
 str         i64     i64   list[i64]     list[i64]  
╞════════════╪════════╪══════╪══════════════╪════════════╡
 chr1        10      11    [6, 8]        [10, 9]    
 chr1        1       4     [0, 5]        [2, 7]     
 chr1        4       5     [5, 0]        [7, 2]     
 chr1        7       8     [6, 5]        [10, 7]    
└────────────┴────────┴──────┴──────────────┴────────────┘

如何以最经济的方式展开数据框使每个标量条目重复两次每个列表中的每个项目都作为标量值列出单独占用一行我猜这个操作非常常见应该有内置的方法



┌────────────┬────────┬──────┬──────────────┬────────────┐
 chromosome  starts  ends  starts_right  ends_right 
 ---         ---     ---   ---           ---        
 str         i64     i64   i64           i64        
╞════════════╪════════╪══════╪══════════════╪════════════╡
 chr1        10      11    6             10         
 chr1        10      11    8             9          
 chr1        1       4     0             2          
 chr1        1       4     5             7          
...

请注意,代码部分没有被翻译。

英文:

A typical dataframe after a groupby might look like:

import polars as pl
pl.DataFrame(
[
pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
]
)
shape: (4, 5)
┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ list[i64]    ┆ list[i64]  │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ [6, 8]       ┆ [10, 9]    │
│ chr1       ┆ 1      ┆ 4    ┆ [0, 5]       ┆ [2, 7]     │
│ chr1       ┆ 4      ┆ 5    ┆ [5, 0]       ┆ [7, 2]     │
│ chr1       ┆ 7      ┆ 8    ┆ [6, 5]       ┆ [10, 7]    │
└────────────┴────────┴──────┴──────────────┴────────────┘

How do I explode the data frame in the least expensive way? So that each scalar entry is repeated twice and each item in each list is listed once as a scalar value, on its own row. I guess this operation is so common it should be built in somehow.

I.e.

┌────────────┬────────┬──────┬──────────────┬────────────┐
│ chromosome ┆ starts ┆ ends ┆ starts_right ┆ ends_right │
│ ---        ┆ ---    ┆ ---  ┆ ---          ┆ ---        │
│ str        ┆ i64    ┆ i64  ┆ i64          ┆ i64        │
╞════════════╪════════╪══════╪══════════════╪════════════╡
│ chr1       ┆ 10     ┆ 11   ┆ 6            ┆ 10         │
│ chr1       ┆ 10     ┆ 11   ┆ 8            ┆ 9          │
│ chr1       ┆ 1      ┆ 4    ┆ 0            ┆ 2          │
│ chr1       ┆ 1      ┆ 4    ┆ 5            ┆ 7          │
...

答案1

得分: 2

Polars确实已经支持这个功能。只需使用.explode并传递要展开的列即可。

df.explode(['starts_right','ends_right'])
chromosome starts ends starts_right ends_right
"chr1" 10 11 6 10
"chr1" 10 11 8 9
"chr1" 1 4 0 2
"chr1" 1 4 5 7
"chr1" 4 5 5 7
"chr1" 4 5 0 2
"chr1" 7 8 6 10
"chr1" 7 8 5 7
英文:

Polars indeed already has this. Just do .explode and pass the columns which you want to explode on.

df.explode(['starts_right','ends_right'])
chromosome starts ends starts_right ends_right
"chr1" 10 11 6 10
"chr1" 10 11 8 9
"chr1" 1 4 0 2
"chr1" 1 4 5 7
"chr1" 4 5 5 7
"chr1" 4 5 0 2
"chr1" 7 8 6 10
"chr1" 7 8 5 7

huangapple
  • 本文由 发表于 2023年6月26日 23:03:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557910.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定