使用Polars计算布尔(或数值)列中连续True(或1)值的数量?

huangapple go评论76阅读模式
英文:

Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars?

问题

  1. import polars
  2. df = pl.DataFrame(
  3. {"values": [True,True,True,False,False,True,False,False,True,True]}
  4. )
  5. (
  6. df.lazy()
  7. .with_column(
  8. pl.when(pl.col("values") == True).then(
  9. pl.col("row_nr")
  10. ).fill_null(
  11. strategy = "forward"
  12. ).alias("id_consecutive_Trues")
  13. )
  14. .with_column(
  15. pl.col("id_consecutive_Trues").value_counts(sort = True)
  16. )
  17. .with_column(
  18. (
  19. pl.col("id_consecutive_Trues").arr.eval(
  20. pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
  21. ).arr.max()
  22. - pl.lit(1)
  23. ).alias("max_consecutive_true_values")
  24. )
  25. .collect()
  26. )
英文:

I am hoping to count consecutive values in a column, preferably using Polars expressions.

  1. import polars
  2. df = pl.DataFrame(
  3. {"values": [True,True,True,False,False,True,False,False,True,True]}
  4. )

With the example data frame above, I would like to count the number of consecutive True values.

Below is example output using R's Data.Table package.

  1. library(data.table)
  2. dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
  3. dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
  4. dt
value value2
TRUE NA
TRUE NA
TRUE 3
FALSE NA
FALSE NA
TRUE 1
FALSE NA
FALSE NA
TRUE NA
TRUE 2

Any ideas who this would be done efficiently using Polars?

[EDIT with a new approach]

I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?

  1. (
  2. df.lazy()
  3. .with_row_count()
  4. .with_column(
  5. pl.when(pl.col("value") == False).then(
  6. pl.col("row_nr")
  7. ).fill_null(
  8. strategy = "forward"
  9. ).alias("id_consecutive_Trues")
  10. )
  11. .with_column(
  12. pl.col("id_consecutive_Trues").value_counts(sort = True)
  13. )
  14. .with_column(
  15. (
  16. pl.col("id_consecutive_Trues").arr.eval(
  17. pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
  18. ).arr.max()
  19. - pl.lit(1)
  20. ).alias("max_consecutive_true_values")
  21. )
  22. .collect()
  23. )

答案1

得分: 5

One possible definition of the problem is:

  • On the last row of each true group, give me the group length.
  1. df.with_columns(
  2. pl.when(pl.col("values") & pl.col("values").is_last())
  3. .then(pl.count())
  4. .over(pl.col("values").rle_id())
  5. )
  1. shape: (10, 2)
  2. ┌────────┬───────┐
  3. values count
  4. --- ---
  5. bool u32
  6. ╞════════╪═══════╡
  7. true null
  8. true null
  9. true 3
  10. false null
  11. false null
  12. true 1
  13. false null
  14. false null
  15. true null
  16. true 2
  17. └────────┴───────┘

.rle_id() 给出了连续值的“组 ID”。

  1. df.with_columns(group = pl.col("values").rle_id())
  1. shape: (10, 2)
  2. ┌────────┬───────┐
  3. values group
  4. --- ---
  5. bool u32
  6. ╞════════╪═══════╡
  7. true 0
  8. true 0
  9. true 0
  10. false 1
  11. false 1
  12. true 2
  13. false 3
  14. false 3
  15. true 4
  16. true 4
  17. └────────┴───────┘

.is_last().over() 使我们能够检测每个组的最后一行。

pl.count().over() 给出了组中的行数。

英文:

One possible definition of the problem is:

  • On the last row of each true group, give me the group length.
  1. df.with_columns(
  2. pl.when(pl.col("values") & pl.col("values").is_last())
  3. .then(pl.count())
  4. .over(pl.col("values").rle_id())
  5. )
  1. shape: (10, 2)
  2. ┌────────┬───────┐
  3. values count
  4. --- ---
  5. bool u32
  6. ╞════════╪═══════╡
  7. true null
  8. true null
  9. true 3
  10. false null
  11. false null
  12. true 1
  13. false null
  14. false null
  15. true null
  16. true 2
  17. └────────┴───────┘

.rle_id() gives us "group ids" for the consecutive values.

  1. df.with_columns(group = pl.col("values").rle_id())
  1. shape: (10, 2)
  2. ┌────────┬───────┐
  3. values group
  4. --- ---
  5. bool u32
  6. ╞════════╪═══════╡
  7. true 0
  8. true 0
  9. true 0
  10. false 1
  11. false 1
  12. true 2
  13. false 3
  14. false 3
  15. true 4
  16. true 4
  17. └────────┴───────┘

.is_last() with the .over() allows us to detect the last row of each group.

pl.count() with .over() gives us the number of rows in the group.

huangapple
  • 本文由 发表于 2023年2月10日 04:55:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404335.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定