2023年4月13日 18:18:27go评论183阅读模式

英文:

Python Polars Expression For Cumulative Row Count by Day by Category

问题

以下是翻译好的部分：

给定一个 Polars 数据框：

import datetime
import polars as pl

df = pl.DataFrame(
    {
        "time": [
            datetime.datetime(2023, 1, 1, 9),
            datetime.datetime(2023, 1, 1, 10),
            datetime.datetime(2023, 1, 1, 12),
            datetime.datetime(2023, 1, 2, 9),
            datetime.datetime(2023, 1, 2, 10),
            datetime.datetime(2023, 1, 3, 12),
        ],
        "category": [1, 1, 2, 1, 2, 1],
    }
)

我正在寻找一个名为 row_count_by_day 的表达式，使得：

expr = ...alias("row_count_by_day")
df = df.with_column(expr)
print(df)

会产生以下输出：

shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
│ time                ┆ category ┆ row_count_by_day │
│ ---                 ┆ ---      ┆ ---              │
│ datetime[μs]        ┆ i64      ┆ i64              │
╞═════════════════════╪══════════╪══════════════════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1                │
└─────────────────────┴──────────┴──────────────────┘

即，在单日内特定类别的累积行数。

最好还要扩展到多个类别。例如，如果类别 1 在另一列中有相同值的情况下有 5 行，它们只会被计算一次。

我尝试了几种针对类别的窗口函数的组合，但由于我在 Polars 中不太熟悉，没有接近解决方案。我最好的猜测是在类别上进行一天的窗口大小的 rolling_sum，使用一个常数列和1作为窗口大小。但我无法让这个方法起作用。

英文:

Given a polars dataframe

import datetime
import polars as pl


df = pl.DataFrame(
    {
        &quot;time&quot;: [
            datetime.datetime(2023, 1, 1, 9),
            datetime.datetime(2023, 1, 1, 10),
            datetime.datetime(2023, 1, 1, 12),
            datetime.datetime(2023, 1, 2, 9),
            datetime.datetime(2023, 1, 2, 10),
            datetime.datetime(2023, 1, 3, 12),
        ],
        &quot;category&quot;: [1,1,2,1,2,1],
    }
)

I am seeking an expression row_count_by_day such that

expr = ...alias(&quot;row_count_by_day&quot;)
df = df.with_column(expr)
print(df)

yields

shape: (6, 3)
┌─────────────────────┬──────────┬──────────────────┐
│ time                ┆ category ┆ row_count_by_day │
│ ---                 ┆ ---      ┆ ---              │
│ datetime[μs]        ┆ i64      ┆ i64              │
╞═════════════════════╪══════════╪══════════════════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1                │
└─────────────────────┴──────────┴──────────────────┘

That is, the cumulative row count for a particular category in a single day.

Ideally also I'd like to extend this to multiple categories. e.g. If category 1 has 5 rows with the same value in another column, they'd only be counted once.

I tried several combinations of window functions over category, but being a novice in polars, I didn't get close. My best guess was some sort of rolling_sum of a constant column with ones over the category with a window size of 1d. I couldn't get this to work.

答案1

得分: 1

It sounds like .dt.truncate is the piece you're missing.

df.with_columns(
   pl.first().cumcount()
     .over(pl.col("time").dt.truncate("1d"), "category")
     .alias("count") + 1
)

shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time                ┆ category ┆ count │
│ ---                 ┆ ---      ┆ ---   │
│ datetime[μs]        ┆ i64      ┆ u32   │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2     │
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1     │
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1     │
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1     │
└─────────────────────┴──────────┴───────┘

英文:

It sounds like .dt.truncate is the piece you're missing.

df.with_columns(
   pl.first().cumcount()
     .over(pl.col(&quot;time&quot;).dt.truncate(&quot;1d&quot;), &quot;category&quot;)
     .alias(&quot;count&quot;) + 1
)

shape: (6, 3)
┌─────────────────────┬──────────┬───────┐
│ time                ┆ category ┆ count │
│ ---                 ┆ ---      ┆ ---   │
│ datetime[μs]        ┆ i64      ┆ u32   │
╞═════════════════════╪══════════╪═══════╡
│ 2023-01-01 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-01 10:00:00 ┆ 1        ┆ 2     │
│ 2023-01-01 12:00:00 ┆ 2        ┆ 1     │
│ 2023-01-02 09:00:00 ┆ 1        ┆ 1     │
│ 2023-01-02 10:00:00 ┆ 2        ┆ 1     │
│ 2023-01-03 12:00:00 ┆ 1        ┆ 1     │
└─────────────────────┴──────────┴───────┘

答案2

得分: 1

你可以这样做：

df.with_columns(
    row_count_by_day=pl.repeat(1, pl.count())
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category'))

这里奇怪的是 pl.repeat(1, pl.count())，它似乎和 pl.lit(1) 是一样的，但由于 polars 的标量广播方式不同，它们并不相同。如果你这样做两个上下文：

df.with_columns(one=pl.lit(1)).with_columns(
    row_count_by_day=pl.col('one')
    .cumsum()
    .over(
        pl.col('time')
        .dt.truncate("1d"),
        'category')).drop('one')

那么第一个 with_columns 会将 1 广播到所有行，因此在引用 one 列时，第二个上下文会正常工作。

英文:

You can do like this:

    df.with_columns(
    row_count_by_day=pl.repeat(1,pl.count())
    .cumsum()
    .over(
        pl.col(&#39;time&#39;)
        .dt.truncate(&quot;1d&quot;),
        &#39;category&#39;))

The weird thing here is the pl.repeat(1,pl.count()) which seems like it's the same thing as pl.lit(1) but it's not because of the way polars broadcasts the scalar 1. If you do two contexts like this:

df.with_columns(one=pl.lit(1)).with_columns(
    row_count_by_day=pl.col(&#39;one&#39;)
    .cumsum()
    .over(
        pl.col(&#39;time&#39;)
        .dt.truncate(&quot;1d&quot;),
        &#39;category&#39;)).drop(&#39;one&#39;)

then the first with_columns broadcasts the 1 to all rows so then the second context works when referring to the one column.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python Polars 表达式：按天按类别累计行数

问题

答案1

答案2

OpenCV，直接编辑图像列表会导致不准确。

在 pandas DataFrame 中一次性重新排序多个列级

BigQuery Cloud Function 的入口点是什么？

How to convert a conda env yaml file to a list of requirements for a settings.ini file accounting for channels and conversions for pypi

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论