英文:
Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars?
问题
import polars
df = pl.DataFrame(
{"values": [True,True,True,False,False,True,False,False,True,True]}
)
(
df.lazy()
.with_column(
pl.when(pl.col("values") == True).then(
pl.col("row_nr")
).fill_null(
strategy = "forward"
).alias("id_consecutive_Trues")
)
.with_column(
pl.col("id_consecutive_Trues").value_counts(sort = True)
)
.with_column(
(
pl.col("id_consecutive_Trues").arr.eval(
pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
).arr.max()
- pl.lit(1)
).alias("max_consecutive_true_values")
)
.collect()
)
英文:
I am hoping to count consecutive values in a column, preferably using Polars expressions.
import polars
df = pl.DataFrame(
{"values": [True,True,True,False,False,True,False,False,True,True]}
)
With the example data frame above, I would like to count the number of consecutive True values.
Below is example output using R's Data.Table package.
library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
value | value2 |
---|---|
TRUE | NA |
TRUE | NA |
TRUE | 3 |
FALSE | NA |
FALSE | NA |
TRUE | 1 |
FALSE | NA |
FALSE | NA |
TRUE | NA |
TRUE | 2 |
Any ideas who this would be done efficiently using Polars?
[EDIT with a new approach]
I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?
(
df.lazy()
.with_row_count()
.with_column(
pl.when(pl.col("value") == False).then(
pl.col("row_nr")
).fill_null(
strategy = "forward"
).alias("id_consecutive_Trues")
)
.with_column(
pl.col("id_consecutive_Trues").value_counts(sort = True)
)
.with_column(
(
pl.col("id_consecutive_Trues").arr.eval(
pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
).arr.max()
- pl.lit(1)
).alias("max_consecutive_true_values")
)
.collect()
)
答案1
得分: 5
One possible definition of the problem is:
- On the last row of each
true
group, give me the group length.
df.with_columns(
pl.when(pl.col("values") & pl.col("values").is_last())
.then(pl.count())
.over(pl.col("values").rle_id())
)
shape: (10, 2)
┌────────┬───────┐
│ values ┆ count │
│ --- ┆ --- │
│ bool ┆ u32 │
╞════════╪═══════╡
│ true ┆ null │
│ true ┆ null │
│ true ┆ 3 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ 1 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ null │
│ true ┆ 2 │
└────────┴───────┘
.rle_id()
给出了连续值的“组 ID”。
df.with_columns(group = pl.col("values").rle_id())
shape: (10, 2)
┌────────┬───────┐
│ values ┆ group │
│ --- ┆ --- │
│ bool ┆ u32 │
╞════════╪═══════╡
│ true ┆ 0 │
│ true ┆ 0 │
│ true ┆ 0 │
│ false ┆ 1 │
│ false ┆ 1 │
│ true ┆ 2 │
│ false ┆ 3 │
│ false ┆ 3 │
│ true ┆ 4 │
│ true ┆ 4 │
└────────┴───────┘
.is_last()
与 .over()
使我们能够检测每个组的最后一行。
pl.count()
与 .over()
给出了组中的行数。
英文:
One possible definition of the problem is:
- On the last row of each
true
group, give me the group length.
df.with_columns(
pl.when(pl.col("values") & pl.col("values").is_last())
.then(pl.count())
.over(pl.col("values").rle_id())
)
shape: (10, 2)
┌────────┬───────┐
│ values ┆ count │
│ --- ┆ --- │
│ bool ┆ u32 │
╞════════╪═══════╡
│ true ┆ null │
│ true ┆ null │
│ true ┆ 3 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ 1 │
│ false ┆ null │
│ false ┆ null │
│ true ┆ null │
│ true ┆ 2 │
└────────┴───────┘
.rle_id()
gives us "group ids" for the consecutive values.
df.with_columns(group = pl.col("values").rle_id())
shape: (10, 2)
┌────────┬───────┐
│ values ┆ group │
│ --- ┆ --- │
│ bool ┆ u32 │
╞════════╪═══════╡
│ true ┆ 0 │
│ true ┆ 0 │
│ true ┆ 0 │
│ false ┆ 1 │
│ false ┆ 1 │
│ true ┆ 2 │
│ false ┆ 3 │
│ false ┆ 3 │
│ true ┆ 4 │
│ true ┆ 4 │
└────────┴───────┘
.is_last()
with the .over()
allows us to detect the last row of each group.
pl.count()
with .over()
gives us the number of rows in the group.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论