Python Polars 表达式:按天按类别累计行数

huangapple go评论98阅读模式
英文:

Python Polars Expression For Cumulative Row Count by Day by Category

问题

以下是翻译好的部分:

给定一个 Polars 数据框:

import datetime
import polars as pl

df = pl.DataFrame(
    {
        "time": [
            datetime.datetime(2023, 1, 1, 9),
            datetime.datetime(2023, 1, 1, 10),
            datetime.datetime(2023, 1, 1, 12),
            datetime.datetime(2023, 1, 2, 9),
            datetime.datetime(2023, 1, 2, 10),
            datetime.datetime(2023, 1, 3, 12),
        ],
        "category": [1, 1, 2, 1, 2, 1],
    }
)

我正在寻找一个名为 row_count_by_day 的表达式,使得:

expr = ...alias("row_count_by_day")
df = df.with_column(expr)
print(df)

会产生以下输出:

shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
 time                 category  row_count_by_day 
 ---                  ---       ---              
 datetime[μs]         i64       i64              
╞═════════════════════╪══════════╪══════════════════╡
 2023-01-01 09:00:00  1         1                
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 2023-01-01 10:00:00  1         2                
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 2023-01-01 12:00:00  2         1                
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 2023-01-02 09:00:00  1         1                
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 2023-01-02 10:00:00  2         1                
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
 2023-01-03 12:00:00  1         1                
└─────────────────────┴──────────┴──────────────────┘

即,在单日内特定类别的累积行数。

最好还要扩展到多个类别。例如,如果类别 1 在另一列中有相同值的情况下有 5 行,它们只会被计算一次。

我尝试了几种针对类别的窗口函数的组合,但由于我在 Polars 中不太熟悉,没有接近解决方案。我最好的猜测是在类别上进行一天的窗口大小的 rolling_sum,使用一个常数列和1作为窗口大小。但我无法让这个方法起作用。

英文:

Given a polars dataframe

import datetime
import polars as pl


df = pl.DataFrame(
    {
        "time": [
            datetime.datetime(2023, 1, 1, 9),
            datetime.datetime(2023, 1, 1, 10),
            datetime.datetime(2023, 1, 1, 12),
            datetime.datetime(2023, 1, 2, 9),
            datetime.datetime(2023, 1, 2, 10),
            datetime.datetime(2023, 1, 3, 12),
        ],
        "category": [1,1,2,1,2,1],
    }
)

I am seeking an expression row_count_by_day such that

expr = ...alias("row_count_by_day")
df = df.with_column(expr)
print(df)

yields

shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
│ time                ┆ category ┆ row_count_by_day │
│ ---                 ┆ ---      ┆ ---              │
│ datetime[μs]        ┆ i64      ┆ i64              │
╞═════════════════════╪══════════╪══════════════════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1                │
└─────────────────────┴──────────┴──────────────────┘

That is, the cumulative row count for a particular category in a single day.

Ideally also I'd like to extend this to multiple categories. e.g. If category 1 has 5 rows with the same value in another column, they'd only be counted once.

I tried several combinations of window functions over category, but being a novice in polars, I didn't get close. My best guess was some sort of rolling_sum of a constant column with ones over the category with a window size of 1d. I couldn't get this to work.

答案1

得分: 1

It sounds like .dt.truncate is the piece you're missing.

df.with_columns(
   pl.first().cumcount()
     .over(pl.col("time").dt.truncate("1d"), "category")
     .alias("count") + 1
)
shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time                ┆ category ┆ count │
│ ---                 ┆ ---      ┆ ---   │
│ datetime[μs]        ┆ i64      ┆ u32   │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2     │
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1     │
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1     │
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1     │
└─────────────────────┴──────────┴───────┘
英文:

It sounds like .dt.truncate is the piece you're missing.

df.with_columns(
   pl.first().cumcount()
     .over(pl.col("time").dt.truncate("1d"), "category")
     .alias("count") + 1
) 
shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time                ┆ category ┆ count │
│ ---                 ┆ ---      ┆ ---   │
│ datetime[μs]        ┆ i64      ┆ u32   │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2     │
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1     │
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1     │
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1     │
└─────────────────────┴──────────┴───────┘

答案2

得分: 1

你可以这样做:

df.with_columns(
    row_count_by_day=pl.repeat(1, pl.count())
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category'))

这里奇怪的是 pl.repeat(1, pl.count()),它似乎和 pl.lit(1) 是一样的,但由于 polars 的标量广播方式不同,它们并不相同。如果你这样做两个上下文:

df.with_columns(one=pl.lit(1)).with_columns(
    row_count_by_day=pl.col('one')
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category')).drop('one')

那么第一个 with_columns 会将 1 广播到所有行,因此在引用 one 列时,第二个上下文会正常工作。

英文:

You can do like this:

    df.with_columns(
    row_count_by_day=pl.repeat(1,pl.count())
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category'))

The weird thing here is the pl.repeat(1,pl.count()) which seems like it's the same thing as pl.lit(1) but it's not because of the way polars broadcasts the scalar 1. If you do two contexts like this:

df.with_columns(one=pl.lit(1)).with_columns(
    row_count_by_day=pl.col('one')
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category')).drop('one')

then the first with_columns broadcasts the 1 to all rows so then the second context works when referring to the one column.

huangapple
  • 本文由 发表于 2023年4月13日 18:18:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76004289.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定