2023年2月10日 04:55:54go评论76阅读模式

英文:

Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars?

问题

import polars
df = pl.DataFrame(
   {"values": [True,True,True,False,False,True,False,False,True,True]}
)
(
    df.lazy()
    .with_column(
        pl.when(pl.col("values") == True).then(
            pl.col("row_nr")
        ).fill_null(
            strategy = "forward"
        ).alias("id_consecutive_Trues")
    )
    .with_column(
        pl.col("id_consecutive_Trues").value_counts(sort = True)
    )
    .with_column(
        (
            pl.col("id_consecutive_Trues").arr.eval(
                pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
            ).arr.max()
            - pl.lit(1)
        ).alias("max_consecutive_true_values")
    )
    .collect()
)

英文:

I am hoping to count consecutive values in a column, preferably using Polars expressions.

import polars
df = pl.DataFrame(
   {&quot;values&quot;: [True,True,True,False,False,True,False,False,True,True]}
)

With the example data frame above, I would like to count the number of consecutive True values.

Below is example output using R's Data.Table package.

library(data.table)
dt &lt;- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N &amp; value == 1, .N, NA_integer_), by = rleid(value)]
dt

value	value2
TRUE	NA
TRUE	NA
TRUE	3
FALSE	NA
FALSE	NA
TRUE	1
FALSE	NA
FALSE	NA
TRUE	NA
TRUE	2

Any ideas who this would be done efficiently using Polars?

[EDIT with a new approach]

I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?

(
    df.lazy()
    .with_row_count()
    .with_column(
        pl.when(pl.col(&quot;value&quot;) == False).then(
            pl.col(&quot;row_nr&quot;)
            
        ).fill_null(
            strategy = &quot;forward&quot;
        ).alias(&quot;id_consecutive_Trues&quot;)
    )
    .with_column(
        pl.col(&quot;id_consecutive_Trues&quot;).value_counts(sort = True)
    )
    .with_column(
        (
            pl.col(&quot;id_consecutive_Trues&quot;).arr.eval(
                pl.element().struct().rename_fields([&quot;value&quot;, &quot;count&quot;]).struct.field(&quot;count&quot;)
            ).arr.max()
            - pl.lit(1)
        ).alias(&quot;max_consecutive_true_values&quot;)
    )
    .collect()
)

答案1

得分: 5

One possible definition of the problem is:

On the last row of each true group, give me the group length.

df.with_columns(
   pl.when(pl.col("values") & pl.col("values").is_last())
     .then(pl.count())
     .over(pl.col("values").rle_id())
)

shape: (10, 2)
┌────────┬───────┐
│ values ┆ count │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ null  │
│ true   ┆ null  │
│ true   ┆ 3     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ 1     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ null  │
│ true   ┆ 2     │
└────────┴───────┘

.rle_id() 给出了连续值的“组 ID”。

df.with_columns(group = pl.col("values").rle_id())

shape: (10, 2)
┌────────┬───────┐
│ values ┆ group │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ 0     │
│ true   ┆ 0     │
│ true   ┆ 0     │
│ false  ┆ 1     │
│ false  ┆ 1     │
│ true   ┆ 2     │
│ false  ┆ 3     │
│ false  ┆ 3     │
│ true   ┆ 4     │
│ true   ┆ 4     │
└────────┴───────┘

.is_last() 与 .over() 使我们能够检测每个组的最后一行。

pl.count() 与 .over() 给出了组中的行数。

英文:

One possible definition of the problem is:

On the last row of each true group, give me the group length.

df.with_columns(
   pl.when(pl.col(&quot;values&quot;) &amp; pl.col(&quot;values&quot;).is_last())
     .then(pl.count())
     .over(pl.col(&quot;values&quot;).rle_id())
)

shape: (10, 2)
┌────────┬───────┐
│ values ┆ count │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ null  │
│ true   ┆ null  │
│ true   ┆ 3     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ 1     │
│ false  ┆ null  │
│ false  ┆ null  │
│ true   ┆ null  │
│ true   ┆ 2     │
└────────┴───────┘

.rle_id() gives us "group ids" for the consecutive values.

df.with_columns(group = pl.col(&quot;values&quot;).rle_id())

shape: (10, 2)
┌────────┬───────┐
│ values ┆ group │
│ ---    ┆ ---   │
│ bool   ┆ u32   │
╞════════╪═══════╡
│ true   ┆ 0     │
│ true   ┆ 0     │
│ true   ┆ 0     │
│ false  ┆ 1     │
│ false  ┆ 1     │
│ true   ┆ 2     │
│ false  ┆ 3     │
│ false  ┆ 3     │
│ true   ┆ 4     │
│ true   ┆ 4     │
└────────┴───────┘

.is_last() with the .over() allows us to detect the last row of each group.

pl.count() with .over() gives us the number of rows in the group.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用Polars计算布尔（或数值）列中连续True（或1）值的数量？

问题

答案1

Polars read_excel 将日期转换为字符串

如何使用 “when”、”then” 和 “otherwise” 条件ally 替换 Polars 中的行值？

Polars 中的 .str.replace 使用表达式或 .str.split 使用正则表达式

从pandas到polars的Dataframe转换–最终维度的差异

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。