如何在分组聚合后展开数据框?

huangapple go评论74阅读模式
英文:

How to explode dataframe after groupby aggregation?

问题

以下是翻译好的部分:

  1. 一个经过groupby操作后的典型数据框可能如下所示
  2. import polars as pl
  3. pl.DataFrame(
  4. [
  5. pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
  6. pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
  7. pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
  8. pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
  9. pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
  10. ]
  11. )
  12. 形状: (4, 5)
  13. ┌────────────┬────────┬──────┬──────────────┬────────────┐
  14. chromosome starts ends starts_right ends_right
  15. --- --- --- --- ---
  16. str i64 i64 list[i64] list[i64]
  17. ╞════════════╪════════╪══════╪══════════════╪════════════╡
  18. chr1 10 11 [6, 8] [10, 9]
  19. chr1 1 4 [0, 5] [2, 7]
  20. chr1 4 5 [5, 0] [7, 2]
  21. chr1 7 8 [6, 5] [10, 7]
  22. └────────────┴────────┴──────┴──────────────┴────────────┘
  23. 如何以最经济的方式展开数据框使每个标量条目重复两次每个列表中的每个项目都作为标量值列出单独占用一行我猜这个操作非常常见应该有内置的方法
  24. ┌────────────┬────────┬──────┬──────────────┬────────────┐
  25. chromosome starts ends starts_right ends_right
  26. --- --- --- --- ---
  27. str i64 i64 i64 i64
  28. ╞════════════╪════════╪══════╪══════════════╪════════════╡
  29. chr1 10 11 6 10
  30. chr1 10 11 8 9
  31. chr1 1 4 0 2
  32. chr1 1 4 5 7
  33. ...

请注意,代码部分没有被翻译。

英文:

A typical dataframe after a groupby might look like:

  1. import polars as pl
  2. pl.DataFrame(
  3. [
  4. pl.Series("chromosome", ['chr1', 'chr1', 'chr1', 'chr1'], dtype=pl.Utf8),
  5. pl.Series("starts", [10, 1, 4, 7], dtype=pl.Int64),
  6. pl.Series("ends", [11, 4, 5, 8], dtype=pl.Int64),
  7. pl.Series("starts_right", [[6, 8], [0, 5], [5, 0], [6, 5]], dtype=pl.List(pl.Int64)),
  8. pl.Series("ends_right", [[10, 9], [2, 7], [7, 2], [10, 7]], dtype=pl.List(pl.Int64)),
  9. ]
  10. )
  11. shape: (4, 5)
  12. ┌────────────┬────────┬──────┬──────────────┬────────────┐
  13. chromosome starts ends starts_right ends_right
  14. --- --- --- --- ---
  15. str i64 i64 list[i64] list[i64]
  16. ╞════════════╪════════╪══════╪══════════════╪════════════╡
  17. chr1 10 11 [6, 8] [10, 9]
  18. chr1 1 4 [0, 5] [2, 7]
  19. chr1 4 5 [5, 0] [7, 2]
  20. chr1 7 8 [6, 5] [10, 7]
  21. └────────────┴────────┴──────┴──────────────┴────────────┘

How do I explode the data frame in the least expensive way? So that each scalar entry is repeated twice and each item in each list is listed once as a scalar value, on its own row. I guess this operation is so common it should be built in somehow.

I.e.

  1. ┌────────────┬────────┬──────┬──────────────┬────────────┐
  2. chromosome starts ends starts_right ends_right
  3. --- --- --- --- ---
  4. str i64 i64 i64 i64
  5. ╞════════════╪════════╪══════╪══════════════╪════════════╡
  6. chr1 10 11 6 10
  7. chr1 10 11 8 9
  8. chr1 1 4 0 2
  9. chr1 1 4 5 7
  10. ...

答案1

得分: 2

Polars确实已经支持这个功能。只需使用.explode并传递要展开的列即可。

  1. df.explode(['starts_right','ends_right'])
chromosome starts ends starts_right ends_right
"chr1" 10 11 6 10
"chr1" 10 11 8 9
"chr1" 1 4 0 2
"chr1" 1 4 5 7
"chr1" 4 5 5 7
"chr1" 4 5 0 2
"chr1" 7 8 6 10
"chr1" 7 8 5 7
英文:

Polars indeed already has this. Just do .explode and pass the columns which you want to explode on.

  1. df.explode(['starts_right','ends_right'])
chromosome starts ends starts_right ends_right
"chr1" 10 11 6 10
"chr1" 10 11 8 9
"chr1" 1 4 0 2
"chr1" 1 4 5 7
"chr1" 4 5 5 7
"chr1" 4 5 0 2
"chr1" 7 8 6 10
"chr1" 7 8 5 7

huangapple
  • 本文由 发表于 2023年6月26日 23:03:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/76557910.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定