Polars – Groupby_rolling – 如何每天重置计数?

huangapple go评论75阅读模式
英文:

Polars - Groupby_rolling - How to reset count each day?

问题

我已经从Pandas-Python切换到Polars,我正在尝试弄清楚如何对每一天进行滚动求和。我不只想要每一天的总和,我想要每一天每个时间段(5分钟段)的总和,以"rolling_sum"的方式呈现。以下是我目前的代码:

from datetime import datetime, timedelta

df = pl.DataFrame(
    {
    "date": pl.date_range(
            datetime(1985, 1, 1),
            datetime(1985, 1, 6),
            timedelta(minutes=5),
            time_unit="ns",
            eager=True).alias("date"),
    }
)
df = df.with_columns(
  A=pl.lit(2, dtype=pl.Int32)
)

df = df.lazy().groupby_rolling(
    "date",
    period="1d",
).agg(
    pl.col("A").sum().alias("A_daily"),
  ).collect()

print(df[280:])

这会产生以下结果:

shape: (1_161, 2)
┌─────────────────────┬─────────┐
│ date                ┆ A_daily │
│ ---                 ┆ ---     │
│ datetime[ns]        ┆ i32     │
╞═════════════════════╪═════════╡
│ 1985-01-01 23:20:00 ┆ 562     │
│ 1985-01-01 23:25:00 ┆ 564     │
│ 1985-01-01 23:30:00 ┆ 566     │
│ 1985-01-01 23:35:00 ┆ 568     │
│ 1985-01-01 23:40:00 ┆ 570     │
│ 1985-01-01 23:45:00 ┆ 572     │
│ 1985-01-01 23:50:00 ┆ 574     │
│ 1985-01-01 23:55:00 ┆ 576     │
│ 1985-01-02 00:00:00 ┆ 576     │
│ 1985-01-02 00:05:00 ┆ 576     │
│ 1985-01-02 00:10:00 ┆ 576     │
│ 1985-01-02 00:15:00 ┆ 576     │
│ 1985-01-02 00:20:00 ┆ 576     │
│ 1985-01-02 00:25:00 ┆ 576     │
│ 1985-01-02 00:30:00 ┆ 576     │
│ 1985-01-02 00:35:00 ┆ 576     │
│ 1985-01-02 00:40:00 ┆ 576     │
│ 1985-01-02 00:45:00 ┆ 576     │
│ 1985-01-02 00:50:00 ┆ 576     │
...
│ 1985-01-05 23:50:00 ┆ 576     │
│ 1985-01-05 23:55:00 ┆ 576     │
│ 1985-01-06 00:00:00 ┆ 576     │
└─────────────────────┴─────────┘

我猜我实际上想要的是每一天,A_Daily重新开始,即如下所示:

shape: (1_161, 2)
┌─────────────────────┬─────────┐
│ date                ┆ A_daily │
│ ---                 ┆ ---     │
│ datetime[ns]        ┆ i32     │
╞═════════════════════╪═════════╡
│ 1985-01-01 23:20:00 ┆ 562     │
│ 1985-01-01 23:25:00 ┆ 564     │
│ 1985-01-01 23:30:00 ┆ 566     │
│ 1985-01-01 23:35:00 ┆ 568     │
│ 1985-01-01 23:40:00 ┆ 570     │
│ 1985-01-01 23:45:00 ┆ 572     │
│ 1985-01-01 23:50:00 ┆ 574     │
│ 1985-01-01 23:55:00 ┆ 576     │
**│ 1985-01-02 00:00:00 ┆   2     │**
│ 1985-01-02 00:05:00 ┆   4     │
│ 1985-01-02 00:10:00 ┆   6     │
│ 1985-01-02 00:15:00 ┆   8     │
│ 1985-01-02 00:20:00 ┆  10     │
│ 1985-01-02 00:25:00 ┆  12     │
│ 1985-01-02 00:30:00 ┆  14     │
│ 1985-01-02 00:35:00 ┆  18     │
│ 1985-01-02 00:40:00 ┆  20     │
│ 1985-01-02 00:45:00 ┆  22     │
│ 1985-01-02 00:50:00 ┆  24     │

基本上,我只想在每天重新开始滚动总和。在Pandas中,我会像这样做:

tn = outdata.groupby(pd.Grouper(freq='D', key='_daily_reset'))[
    #     '_vp_volume'].transform('cumsum')

谢谢!

英文:

I've switched from Pandas-Python to Polars and I'm trying to figure out how to essentially do a rolling sum for each day. I don't want just the total for each day, I want the totals for each time period (5m period) each day, in a "rolling_sum" fashion. Here is the code I have so far:

from datetime import datetime, timedelta


df = pl.DataFrame(
    {
    "date": pl.date_range(
            datetime(1985, 1, 1),
            datetime(1985, 1, 6),
            timedelta(minutes=5),
            time_unit="ns",
            eager=True).alias("date"),
    }
)
df = df.with_columns(
  A=pl.lit(2, dtype=pl.Int32)
)


df = df.lazy().groupby_rolling(
    "date",
    period="1d",
).agg(
    pl.col("A").sum().alias("A_daily"),
  ).collect()


print(df[280:])

This produces:

shape: (1_161, 2)
┌─────────────────────┬─────────┐
│ date                ┆ A_daily │
│ ---                 ┆ ---     │
│ datetime[ns]        ┆ i32     │
╞═════════════════════╪═════════╡
│ 1985-01-01 23:20:00 ┆ 562     │
│ 1985-01-01 23:25:00 ┆ 564     │
│ 1985-01-01 23:30:00 ┆ 566     │
│ 1985-01-01 23:35:00 ┆ 568     │
│ 1985-01-01 23:40:00 ┆ 570     │
│ 1985-01-01 23:45:00 ┆ 572     │
│ 1985-01-01 23:50:00 ┆ 574     │
│ 1985-01-01 23:55:00 ┆ 576     │
│ 1985-01-02 00:00:00 ┆ 576     │
│ 1985-01-02 00:05:00 ┆ 576     │
│ 1985-01-02 00:10:00 ┆ 576     │
│ 1985-01-02 00:15:00 ┆ 576     │
│ 1985-01-02 00:20:00 ┆ 576     │
│ 1985-01-02 00:25:00 ┆ 576     │
│ 1985-01-02 00:30:00 ┆ 576     │
│ 1985-01-02 00:35:00 ┆ 576     │
│ 1985-01-02 00:40:00 ┆ 576     │
│ 1985-01-02 00:45:00 ┆ 576     │
│ 1985-01-02 00:50:00 ┆ 576     │
...
│ 1985-01-05 23:50:00 ┆ 576     │
│ 1985-01-05 23:55:00 ┆ 576     │
│ 1985-01-06 00:00:00 ┆ 576     │
└─────────────────────┴─────────┘

What I guess what I'm actually after is for each day, I want the A_Daily to reset, i.e. to look like:

shape: (1_161, 2)
┌─────────────────────┬─────────┐
│ date                ┆ A_daily │
│ ---                 ┆ ---     │
│ datetime[ns]        ┆ i32     │
╞═════════════════════╪═════════╡
│ 1985-01-01 23:20:00 ┆ 562     │
│ 1985-01-01 23:25:00 ┆ 564     │
│ 1985-01-01 23:30:00 ┆ 566     │
│ 1985-01-01 23:35:00 ┆ 568     │
│ 1985-01-01 23:40:00 ┆ 570     │
│ 1985-01-01 23:45:00 ┆ 572     │
│ 1985-01-01 23:50:00 ┆ 574     │
│ 1985-01-01 23:55:00 ┆ 576     │
**│ 1985-01-02 00:00:00 ┆   2     │**
│ 1985-01-02 00:05:00 ┆   4     │
│ 1985-01-02 00:10:00 ┆   6     │
│ 1985-01-02 00:15:00 ┆   8     │
│ 1985-01-02 00:20:00 ┆  10     │
│ 1985-01-02 00:25:00 ┆  12     │
│ 1985-01-02 00:30:00 ┆  14     │
│ 1985-01-02 00:35:00 ┆  18     │
│ 1985-01-02 00:40:00 ┆  20     │
│ 1985-01-02 00:45:00 ┆  22     │
│ 1985-01-02 00:50:00 ┆  24     │

Basically, I want to just reset the rolling sum each day. In Pandas, I'd have done it sort of like this:

tn = outdata.groupby(pd.Grouper(freq='D', key='_daily_reset'))[
    #     '_vp_volume'].transform('cumsum')

Thanks!

答案1

得分: 1

.dt.truncate() 可以对日期进行分组/桶化,然后您可以使用 .cumsum().over() 进行操作。

df.with_columns(
   pl.col("A").cumsum().over(pl.col("date").dt.truncate("1d"))
)
shape: (1_441, 2)
┌─────────────────────┬─────┐
│ date                ┆ A   │
│ ---                 ┆ --- │
│ datetime[ns]        ┆ i32 │
╞═════════════════════╪═════╡
│ 1985-01-01 00:00:00 ┆ 2   │
│ 1985-01-01 00:05:00 ┆ 4   │
│ 1985-01-01 00:10:00 ┆ 6   │
│ 1985-01-01 00:15:00 ┆ 8   │
│ 1985-01-01 00:20:00 ┆ 10  │
│ 1985-01-01 00:25:00 ┆ 12  │
│ …                   ┆ …   │
│ 1985-01-05 23:35:00 ┆ 568 │
│ 1985-01-05 23:40:00 ┆ 570 │
│ 1985-01-05 23:45:00 ┆ 572 │
│ 1985-01-05 23:50:00 ┆ 574 │
│ 1985-01-05 23:55:00 ┆ 576 │
│ 1985-01-06 00:00:00 ┆ 2   │
└─────────────────────┴─────┘
英文:

.dt.truncate() can group/bucket the dates which you can .cumsum().over()

df.with_columns(
   pl.col("A").cumsum().over(pl.col("date").dt.truncate("1d"))
)
shape: (1_441, 2)
┌─────────────────────┬─────┐
│ date                ┆ A   │
│ ---                 ┆ --- │
│ datetime[ns]        ┆ i32 │
╞═════════════════════╪═════╡
│ 1985-01-01 00:00:00 ┆ 2   │
│ 1985-01-01 00:05:00 ┆ 4   │
│ 1985-01-01 00:10:00 ┆ 6   │
│ 1985-01-01 00:15:00 ┆ 8   │
│ 1985-01-01 00:20:00 ┆ 10  │
│ 1985-01-01 00:25:00 ┆ 12  │
│ …                   ┆ …   │
│ 1985-01-05 23:35:00 ┆ 568 │
│ 1985-01-05 23:40:00 ┆ 570 │
│ 1985-01-05 23:45:00 ┆ 572 │
│ 1985-01-05 23:50:00 ┆ 574 │
│ 1985-01-05 23:55:00 ┆ 576 │
│ 1985-01-06 00:00:00 ┆ 2   │
└─────────────────────┴─────┘

huangapple
  • 本文由 发表于 2023年6月9日 03:31:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76435153.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定