使用Polars计算布尔(或数值)列中连续True(或1)值的数量?

huangapple go评论51阅读模式
英文:

Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars?

问题

import polars
df = pl.DataFrame(
   {"values": [True,True,True,False,False,True,False,False,True,True]}
)

(
    df.lazy()
    .with_column(
        pl.when(pl.col("values") == True).then(
            pl.col("row_nr")
        ).fill_null(
            strategy = "forward"
        ).alias("id_consecutive_Trues")
    )
    .with_column(
        pl.col("id_consecutive_Trues").value_counts(sort = True)
    )
    .with_column(
        (
            pl.col("id_consecutive_Trues").arr.eval(
                pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
            ).arr.max()
            - pl.lit(1)
        ).alias("max_consecutive_true_values")
    )
    .collect()
)
英文:

I am hoping to count consecutive values in a column, preferably using Polars expressions.

import polars
df = pl.DataFrame(
   {"values": [True,True,True,False,False,True,False,False,True,True]}
)

With the example data frame above, I would like to count the number of consecutive True values.

Below is example output using R's Data.Table package.

library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
value value2
TRUE NA
TRUE NA
TRUE 3
FALSE NA
FALSE NA
TRUE 1
FALSE NA
FALSE NA
TRUE NA
TRUE 2

Any ideas who this would be done efficiently using Polars?

[EDIT with a new approach]

I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?

(
    df.lazy()
    .with_row_count()
    .with_column(
        pl.when(pl.col("value") == False).then(
            pl.col("row_nr")
            
        ).fill_null(
            strategy = "forward"
        ).alias("id_consecutive_Trues")
    )
    .with_column(
        pl.col("id_consecutive_Trues").value_counts(sort = True)
    )
    .with_column(
        (
            pl.col("id_consecutive_Trues").arr.eval(
                pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
            ).arr.max()
            - pl.lit(1)
        ).alias("max_consecutive_true_values")
    )
    .collect()
)

答案1

得分: 5

One possible definition of the problem is:

  • On the last row of each true group, give me the group length.
df.with_columns(
   pl.when(pl.col("values") & pl.col("values").is_last())
     .then(pl.count())
     .over(pl.col("values").rle_id())
)
shape: (10, 2)
┌────────┬───────┐
 values  count 
 ---     ---   
 bool    u32   
╞════════╪═══════╡
 true    null  
 true    null  
 true    3     
 false   null  
 false   null  
 true    1     
 false   null  
 false   null  
 true    null  
 true    2     
└────────┴───────┘

.rle_id() 给出了连续值的“组 ID”。

df.with_columns(group = pl.col("values").rle_id())
shape: (10, 2)
┌────────┬───────┐
 values  group 
 ---     ---   
 bool    u32   
╞════════╪═══════╡
 true    0     
 true    0     
 true    0     
 false   1     
 false   1     
 true    2     
 false   3     
 false   3     
 true    4     
 true    4     
└────────┴───────┘

.is_last().over() 使我们能够检测每个组的最后一行。

pl.count().over() 给出了组中的行数。

英文:

One possible definition of the problem is:

  • On the last row of each true group, give me the group length.
df.with_columns(
   pl.when(pl.col("values") & pl.col("values").is_last())
     .then(pl.count())
     .over(pl.col("values").rle_id())
)
shape: (10, 2)
┌────────┬───────┐
 values  count 
 ---     ---   
 bool    u32   
╞════════╪═══════╡
 true    null  
 true    null  
 true    3     
 false   null  
 false   null  
 true    1     
 false   null  
 false   null  
 true    null  
 true    2     
└────────┴───────┘

.rle_id() gives us "group ids" for the consecutive values.

df.with_columns(group = pl.col("values").rle_id())
shape: (10, 2)
┌────────┬───────┐
 values  group 
 ---     ---   
 bool    u32   
╞════════╪═══════╡
 true    0     
 true    0     
 true    0     
 false   1     
 false   1     
 true    2     
 false   3     
 false   3     
 true    4     
 true    4     
└────────┴───────┘

.is_last() with the .over() allows us to detect the last row of each group.

pl.count() with .over() gives us the number of rows in the group.

huangapple
  • 本文由 发表于 2023年2月10日 04:55:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/75404335.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定