英文:
Python Polars Expression For Cumulative Row Count by Day by Category
问题
以下是翻译好的部分:
给定一个 Polars 数据框:
import datetime
import polars as pl
df = pl.DataFrame(
{
"time": [
datetime.datetime(2023, 1, 1, 9),
datetime.datetime(2023, 1, 1, 10),
datetime.datetime(2023, 1, 1, 12),
datetime.datetime(2023, 1, 2, 9),
datetime.datetime(2023, 1, 2, 10),
datetime.datetime(2023, 1, 3, 12),
],
"category": [1, 1, 2, 1, 2, 1],
}
)
我正在寻找一个名为 row_count_by_day
的表达式,使得:
expr = ...alias("row_count_by_day")
df = df.with_column(expr)
print(df)
会产生以下输出:
shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
│ time ┆ category ┆ row_count_by_day │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪══════════╪══════════════════╡
│ 2023-01-01 09:00:00 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 10:00:00 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 12:00:00 ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 09:00:00 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 10:00:00 ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-03 12:00:00 ┆ 1 ┆ 1 │
└─────────────────────┴──────────┴──────────────────┘
即,在单日内特定类别的累积行数。
最好还要扩展到多个类别。例如,如果类别 1 在另一列中有相同值的情况下有 5 行,它们只会被计算一次。
我尝试了几种针对类别的窗口函数的组合,但由于我在 Polars 中不太熟悉,没有接近解决方案。我最好的猜测是在类别上进行一天的窗口大小的 rolling_sum
,使用一个常数列和1作为窗口大小。但我无法让这个方法起作用。
英文:
Given a polars dataframe
import datetime
import polars as pl
df = pl.DataFrame(
{
"time": [
datetime.datetime(2023, 1, 1, 9),
datetime.datetime(2023, 1, 1, 10),
datetime.datetime(2023, 1, 1, 12),
datetime.datetime(2023, 1, 2, 9),
datetime.datetime(2023, 1, 2, 10),
datetime.datetime(2023, 1, 3, 12),
],
"category": [1,1,2,1,2,1],
}
)
I am seeking an expression row_count_by_day
such that
expr = ...alias("row_count_by_day")
df = df.with_column(expr)
print(df)
yields
shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
│ time ┆ category ┆ row_count_by_day │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪══════════╪══════════════════╡
│ 2023-01-01 09:00:00 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 10:00:00 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 12:00:00 ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 09:00:00 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 10:00:00 ┆ 2 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-03 12:00:00 ┆ 1 ┆ 1 │
└─────────────────────┴──────────┴──────────────────┘
That is, the cumulative row count for a particular category in a single day.
Ideally also I'd like to extend this to multiple categories. e.g. If category 1 has 5 rows with the same value in another column, they'd only be counted once.
I tried several combinations of window functions over category, but being a novice in polars, I didn't get close. My best guess was some sort of rolling_sum
of a constant column with ones over the category with a window size of 1d
. I couldn't get this to work.
答案1
得分: 1
It sounds like .dt.truncate
is the piece you're missing.
df.with_columns(
pl.first().cumcount()
.over(pl.col("time").dt.truncate("1d"), "category")
.alias("count") + 1
)
shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time ┆ category ┆ count │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ u32 │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1 ┆ 1 │
│ 2023-01-01 10:00:00 ┆ 1 ┆ 2 │
│ 2023-01-01 12:00:00 ┆ 2 ┆ 1 │
│ 2023-01-02 09:00:00 ┆ 1 ┆ 1 │
│ 2023-01-02 10:00:00 ┆ 2 ┆ 1 │
│ 2023-01-03 12:00:00 ┆ 1 ┆ 1 │
└─────────────────────┴──────────┴───────┘
英文:
It sounds like .dt.truncate
is the piece you're missing.
df.with_columns(
pl.first().cumcount()
.over(pl.col("time").dt.truncate("1d"), "category")
.alias("count") + 1
)
shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time ┆ category ┆ count │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ u32 │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1 ┆ 1 │
│ 2023-01-01 10:00:00 ┆ 1 ┆ 2 │
│ 2023-01-01 12:00:00 ┆ 2 ┆ 1 │
│ 2023-01-02 09:00:00 ┆ 1 ┆ 1 │
│ 2023-01-02 10:00:00 ┆ 2 ┆ 1 │
│ 2023-01-03 12:00:00 ┆ 1 ┆ 1 │
└─────────────────────┴──────────┴───────┘
答案2
得分: 1
你可以这样做:
df.with_columns(
row_count_by_day=pl.repeat(1, pl.count())
.cumsum()
.over(
pl.col('time')
.dt.truncate("1d"),
'category'))
这里奇怪的是 pl.repeat(1, pl.count())
,它似乎和 pl.lit(1)
是一样的,但由于 polars 的标量广播方式不同,它们并不相同。如果你这样做两个上下文:
df.with_columns(one=pl.lit(1)).with_columns(
row_count_by_day=pl.col('one')
.cumsum()
.over(
pl.col('time')
.dt.truncate("1d"),
'category')).drop('one')
那么第一个 with_columns
会将 1 广播到所有行,因此在引用 one
列时,第二个上下文会正常工作。
英文:
You can do like this:
df.with_columns(
row_count_by_day=pl.repeat(1,pl.count())
.cumsum()
.over(
pl.col('time')
.dt.truncate("1d"),
'category'))
The weird thing here is the pl.repeat(1,pl.count())
which seems like it's the same thing as pl.lit(1)
but it's not because of the way polars broadcasts the scalar 1. If you do two contexts like this:
df.with_columns(one=pl.lit(1)).with_columns(
row_count_by_day=pl.col('one')
.cumsum()
.over(
pl.col('time')
.dt.truncate("1d"),
'category')).drop('one')
then the first with_columns
broadcasts the 1 to all rows so then the second context works when referring to the one
column.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论