2023年2月16日 16:26:49go评论52阅读模式

英文:

Python-polars: rolling_sum where the window_size from another column

问题

以下是代码部分的翻译:

Consider the following dataframe:
df = pl.DataFrame(
    {
        "date": pl.date_range(
            low=datetime(2023, 2, 1),
            high=datetime(2023, 2, 5),
            interval="1d"),
        "periods": [2, 2, 2, 1, 1],
        "quantity": [10, 12, 14, 16, 18],
        "calculate": [22, 26, 30, 16, 18]
    }
)

"calculate"列是我想要的。这是通过rolling_sum完成的，其中window_size参数是从"periods"列取得的，而不是一个固定的值。

我可以这样做（window_size=2）：

df.select(pl.col("quantity").rolling_sum(window_size=2))

然而，当我尝试这样做时，会出现错误：

df.select(pl.col("quantity").rolling_sum(window_size=pl.col("periods")))

这是错误信息 -

TypeError: argument 'window_size': 'Expr' object cannot be converted to 'PyString'

我如何基于另一列的值传递window_size呢？我也尝试使用groupie_rolling，但也无法弄清楚。

英文:

Consider the following dataframe:

df = pl.DataFrame(
    {
        &quot;date&quot;: pl.date_range(
            low=datetime(2023, 2, 1),
            high=datetime(2023, 2, 5),
            interval=&quot;1d&quot;),
        &quot;periods&quot;: [2, 2, 2, 1, 1],
        &quot;quantity&quot;: [10, 12, 14, 16, 18],
        &quot;calculate&quot;: [22, 26, 30, 16, 18]
    }
)

The column calculate is what I want. This is done by a rolling_sum where the window_size parameter is taken from the periods column, rather than a fixed value.

I can do the following (window_size=2):

df.select(pl.col(&quot;quantity&quot;).rolling_sum(window_size=2))

However, I get an error when I try and do this:

df.select(pl.col(&quot;quantity&quot;).rolling_sum(window_size=pl.col(&quot;periods&quot;)))

This is the error -

TypeError: argument &#39;window_size&#39;: &#39;Expr&#39; object cannot be converted to &#39;PyString&#39;

How do I pass the value of window_size based on another column? I also looked at using groupie_rolling but could not figure it out as well.

答案1

得分: 1

以下是您要翻译的内容：

It seems like this should be easier to do which suggests I may be missing something obvious.
- 似乎这应该更容易做到，这表明我可能漏掉了一些明显的东西。
As a workaround - you could use the row count to generate row indexes for the windows.
- 作为一种解决方法 - 您可以使用行数来生成窗口的行索引。
You could .explode() the window and use .take() + .search_sorted() to find the corresponding values.
- 您可以使用 .explode() 函数来展开窗口，并使用 .take() 和 .search_sorted() 来查找相应的值。
.groupby() can be used to combine the window values again.
- 可以使用 .groupby() 来再次组合窗口的值。

英文:

It seems like this should be easier to do which suggests I may be missing something obvious.

As a workaround - you could use the row count to generate row indexes for the windows.

(  
   df
   .with_row_count()
   .with_columns(
      window = 
         pl.arange(
            pl.col(&quot;row_nr&quot;), 
            pl.col(&quot;row_nr&quot;) + pl.col(&quot;periods&quot;)))
)

shape: (5, 6)
┌────────┬─────────────────────┬─────────┬──────────┬───────────┬───────────┐
│ row_nr | date                | periods | quantity | calculate | window    │
│ ---    | ---                 | ---     | ---      | ---       | ---       │
│ u32    | datetime[μs]        | i64     | i64      | i64       | list[i64] │
╞════════╪═════════════════════╪═════════╪══════════╪═══════════╪═══════════╡
│ 0      | 2023-02-01 00:00:00 | 2       | 10       | 22        | [0, 1]    │
│ 1      | 2023-02-02 00:00:00 | 2       | 12       | 26        | [1, 2]    │
│ 2      | 2023-02-03 00:00:00 | 2       | 14       | 30        | [2, 3]    │
│ 3      | 2023-02-04 00:00:00 | 1       | 16       | 16        | [3]       │
│ 4      | 2023-02-05 00:00:00 | 1       | 18       | 18        | [4]       │
└────────┴─────────────────────┴─────────┴──────────┴───────────┴───────────┘

You could .explode() the window and use .take() + .search_sorted() to find the corresponding values.

.groupby() can be used to combine the window values again.

(  
   df
   .with_row_count()
   .with_columns(
      window = 
         pl.arange(
            pl.col(&quot;row_nr&quot;), 
            pl.col(&quot;row_nr&quot;) + pl.col(&quot;periods&quot;)))
   .explode(&quot;window&quot;)
   .with_columns(
      rolling = 
         pl.col(&quot;quantity&quot;)
           .take(pl.col(&quot;row_nr&quot;).search_sorted(&quot;window&quot;)))
   .groupby(&quot;row_nr&quot;, maintain_order=True)
   .agg([
      pl.exclude(&quot;rolling&quot;).first(), 
      pl.col(&quot;rolling&quot;).sum()
   ])
)

shape: (5, 7)
┌────────┬─────────────────────┬─────────┬──────────┬───────────┬────────┬─────────┐
│ row_nr | date                | periods | quantity | calculate | window | rolling │
│ ---    | ---                 | ---     | ---      | ---       | ---    | ---     │
│ u32    | datetime[μs]        | i64     | i64      | i64       | i64    | i64     │
╞════════╪═════════════════════╪═════════╪══════════╪═══════════╪════════╪═════════╡
│ 0      | 2023-02-01 00:00:00 | 2       | 10       | 22        | 0      | 22      │
│ 1      | 2023-02-02 00:00:00 | 2       | 12       | 26        | 1      | 26      │
│ 2      | 2023-02-03 00:00:00 | 2       | 14       | 30        | 2      | 30      │
│ 3      | 2023-02-04 00:00:00 | 1       | 16       | 16        | 3      | 16      │
│ 4      | 2023-02-05 00:00:00 | 1       | 18       | 18        | 4      | 18      │
└────────┴─────────────────────┴─────────┴──────────┴───────────┴────────┴─────────┘

答案2

得分: 1

很类似于@jqurious的，但（我认为）稍微简化了。

df.lazy() \
    .with_row_count('i') \
    .with_columns(
        window = 
            pl.arange(
                pl.col("i"), 
                pl.col("i") + pl.col("periods")),
            qty=pl.col('quantity').list()
) \
.with_columns(
    rollsum=pl.col('qty').arr.take(pl.col('window')).arr.sum()
) \
.select(pl.exclude(['window','qty','i'])) \
.collect()

它的工作原理与相同的概念，但它实际上是重新创建了整个“quantity”列作为列表，然后使用“window”列来过滤该列表的相应值并将它们相加。

另一种方法是只使用一个循环，这将更节省内存。

首先，你要获取所有唯一的“periods”值，然后在df中初始化一个用于rolling_sum的列，倒序排序，然后用每个周期的计算替换列。最后，将行放回原始顺序。

periods=df.get_column('periods').unique()
df=df.with_columns(pl.lit(None).cast(pl.Float64()).alias("rollsum")).sort('date',reverse=True)
for period in periods:
    df=df.with_columns((pl.when(pl.col('periods')==period).then(pl.col('quantity').rolling_sum(window_size=period)).otherwise(pl.col('rollsum'))).alias('rollsum'))
df=df.sort('date')
df

英文:

Very similar to @jqurious's but (I think) a bit simplified

df.lazy() \
    .with_row_count(&#39;i&#39;) \
    .with_columns(
        window = 
            pl.arange(
                pl.col(&quot;i&quot;), 
                pl.col(&quot;i&quot;) + pl.col(&quot;periods&quot;)),
            qty=pl.col(&#39;quantity&#39;).list()
) \
.with_columns(
    rollsum=pl.col(&#39;qty&#39;).arr.take(pl.col(&#39;window&#39;)).arr.sum()
) \
.select(pl.exclude([&#39;window&#39;,&#39;qty&#39;,&#39;i&#39;])) \
.collect()

It works on the same concept but it just essentially recreates the whole quantity column as a list then using the window column to filter that list to the corresponding values and sum them up.

Another method is to just use a loop which will be more memory efficient.

First, you want to get all the unique values of periods, then initialize a column in the df for the rolling_sum, reverse the order, and then replace the column with a calculation for every period. At the end, put the rows back in the original order.

periods=df.get_column(&#39;periods&#39;).unique()
df=df.with_columns(pl.lit(None).cast(pl.Float64()).alias(&quot;rollsum&quot;)).sort(&#39;date&#39;,reverse=True)
for period in periods:
    df=df.with_columns((pl.when(pl.col(&#39;periods&#39;)==period).then(pl.col(&#39;quantity&#39;).rolling_sum(window_size=period)).otherwise(pl.col(&#39;rollsum&#39;))).alias(&#39;rollsum&#39;))
df=df.sort(&#39;date&#39;)
df

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python-polars：rolling_sum，其中窗口大小来自另一列

问题

答案1

答案2

如何在 datetime.datetime 坐标轴上显示误差线？

如何修复在Pyomo中出现的“ValueError: 约束没有适当的值”错误？

如何在Django中从sqlite切换到mysql数据库？

为什么ttk小部件只在配置更改事件发生后出现，而不是在按下按钮时出现？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论