如何实现groupby_dynamic,然后进行unstack操作?

huangapple go评论83阅读模式
英文:

How to achieve groupby_dynamic followed by unstack?

问题

以下是您提供的内容的翻译:

Context

我有一个 Polars DataFrame("df"),包含一个时间列("date"),一个可能的 ID 列("id")和一些数字列(即特征)。形状是(14852,431)。

df 表示有关财务交易的数据。ID 表示客户;日期是进行这些交易的月份的起始日期。特征是一些“平均值”(例如,平均花费的金额,交易数量等)。

经过适当的处理,此 df 将被馈送到机器学习模型进行培训。

Aim

从定性上讲,我正在尝试做以下事情:

  1. 为每个唯一的 ID(客户)创建一个为期 6 个月的滑动窗口;
  2. 限制 df 到该期间内该 ID 的交易;
  3. “展开”其行。也就是说,如果在给定时间窗口内某个 ID 有 6 笔交易,即我们有一个形状为(6,431)的“restricted_df”,那么我需要用一个形状为(1,431 * 6)的“unstacked_df”来替换它。实际上,如果我有一个特征名称为“money_spent”在“restricted_df”中,那么“unstacked_df”应该包含类似“money_spent_0”,“money_spent_1”,...,“money_spent_5”这样的东西。

Approach

我大致了解这个难题的各个部分,尽管我承认我对 Polars 还很陌生。在我看来,它们是:

  1. groupby_dynamic("date", every="180d", period="180d", by="id")
  2. unstack(step=1, columns=features)

然而,我无法真正让它们有效地工作,至少不是以一种高效的方式。请参见下面的低效解决方案。

Issue

我认为主要问题是,据我理解,经过 groupby_dynamic 后,Polars 期望 .agg 应用于单列,例如通过 pl.col("foo").some_function()。然而,Series 不带有 unstack 方法,因此它不太适用。

Attempted Solution

一个高度低效的方法是将上述 Series 转换为 DataFrame,然后对其进行展开。但是,这本身并不能完全解决问题。实际上,我们最终只会得到一个具有相同 431 列的 DataFrame,其中每一列都包含一个 DataFrame(我们展开的那个)的副本。

这是通过以下方式获得的:

df.groupby_dynamic("date", every="180d", period="180d", by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1)))

对于特征“foo”的示意图如下:

|       foo           |
|col_0 1, col_1 2, ...| 

而不是期望的:

|foo_0|foo_1|...|
|  1  |  2  |...|

为了纠正这个问题,我们可以插入 to_dict(),然后在最后使用 unnest 函数。这是通过以下方式获得的:

df.groupby_dynamic("date", every="180d", period="180d", by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1).to_dict())).unnest()

Question

这可能会起作用,但显然非常低效,我认为这有点过火。有没有办法可以完成这项任务?

Minimal Example

import numpy as np
import polars as pl
from datetime import date

# 生成虚假数据
ids = [1]*6 + [2]*6
start = date(2023, 1, 1)
end =  date(2023, 12, 1)
dates = pl.date_range(start, end, "1mo", name="date", eager=True)
foos = np.arange(0, 12)
bars = np.arange(12, 24)

# 生成 df
df = pl.DataFrame({"id":ids, "date":dates, "foo":foos, "bar":bars})

# 打印 df
print(df)
...

请注意,上述示例是用 Python 编写的,其中包含了一些代码部分,您可以根据需要将其集成到您的环境中。

英文:

Context

I have a Polars DataFrame ("df") consisting of a temporal column ("date"), a would-be ID column ("id") and a number of numerical columns (i.e. features). The shape is (14852, 431).

The df represents data about financial transactions. The ID indicates the client; the date is the starting date of the month those transactions were carried out. The features are some "averages" (e.g. average money spent, number of transactions, etc.).

After appropriate manipulation, this df is to be fed into a Machine Learning model for training purposes.

Aim

Qualitatively I am trying to do the following:

  1. For each unique ID (client), create a 6-months sliding window;
  2. Restrict the df to the transactions in that period for that ID;
  3. "Unstack" its rows. That is: if in a given time window there are 6 transactions for a given ID, i.e. we have a restricted_df of shape (6, 431), I would need to replace it with a unstacked_df of shape (1, 431 * 6). In practice, if I have a feature name "money_spent" in restricted_df, then unstacked_df should contain something like "money_spent_0", "money_spent_1", ..., "money_spent_5".

Approach

I sort of know the various pieces of the puzzle, even though I am admittedly new to Polars. In my opinion they are:

  1. groupby_dynamic("date", every="180d", period="180d", by="id")
  2. unstack(step=1, columns=features)

However, I can't really get them to work, at least not in an efficient manner. See below for an inefficient solution.

Issue

I reckon the main issue is that, as far as my understanding goes, after groupby_dynamic Polars expects .agg that is meant to be applied on single columns, e.g. via pl.col("foo").some_function(). However, Series does not come with the unstack method, so it does not quite work.

Attempted Solution

One highly inefficient approach is to convert the aforementioned Series to a DataFrame, and then unstack this instead. However, this alone does not quite solve the problem. In fact, we merely end up with a df with the same 431 columns, where each one of these contains a DataFrame (the one we unstacked) for each row.

This is obtained via

df.groupby_dynamic("date", every="180d", period="180d", by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1)))

Schematically for, a feature "foo", we end up with something of the form

|       foo           |
|col_0 1, col_1 2, ...| 

Instead of the desired

|foo_0|foo_1|...|
|  1  |  2  |...|

To remedy this we could insert a to_dict() and, at the end, use the unnest function. This is obtained via

df.groupby_dynamic("date", every="180d", period="180d", by="id").agg(pl.col(features).apply(lambda x: pl.DataFrame(x).unstack(step=1).to_dict())).unnest()

Question

This could work, but is obviously very inefficient, and seems overkill to me. Is there any way we can get this thing done?

Minimal Example

import numpy as np
import polars as pl
from datetime import date

# Generate fake data
ids = [1]*6 + [2]*6
start = date(2023, 1, 1)
end =  date(2023, 12, 1)
dates = pl.date_range(start, end, "1mo", name="date", eager=True)
foos = np.arange(0, 12)
bars = np.arange(12, 24)

# Generate df
df = pl.DataFrame({"id":ids, "date":dates, "foo":foos, "bar":bars})

# Print df
print(df)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ …   ┆ …          ┆ …   ┆ …   │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │

# Group df as required
grouped_df = df.groupby_dynamic("date", every="180d", period="180d", by="id")

# Check group content
for _name, group in grouped_df:

  print(group)

shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 1   ┆ 2023-01-01 ┆ 0   ┆ 12  │
│ 1   ┆ 2023-02-01 ┆ 1   ┆ 13  │
│ 1   ┆ 2023-03-01 ┆ 2   ┆ 14  │
│ 1   ┆ 2023-04-01 ┆ 3   ┆ 15  │
│ 1   ┆ 2023-05-01 ┆ 4   ┆ 16  │
│ 1   ┆ 2023-06-01 ┆ 5   ┆ 17  │
└─────┴────────────┴─────┴─────┘
shape: (6, 4)
┌─────┬────────────┬─────┬─────┐
│ id  ┆ date       ┆ foo ┆ bar │
│ --- ┆ ---        ┆ --- ┆ --- │
│ i64 ┆ date       ┆ i64 ┆ i64 │
╞═════╪════════════╪═════╪═════╡
│ 2   ┆ 2023-07-01 ┆ 6   ┆ 18  │
│ 2   ┆ 2023-08-01 ┆ 7   ┆ 19  │
│ 2   ┆ 2023-09-01 ┆ 8   ┆ 20  │
│ 2   ┆ 2023-10-01 ┆ 9   ┆ 21  │
│ 2   ┆ 2023-11-01 ┆ 10  ┆ 22  │
│ 2   ┆ 2023-12-01 ┆ 11  ┆ 23  │
└─────┴────────────┴─────┴─────┘

# Manipulation
result = ...

# Expected output after correct manipulation
print(result)

shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

答案1

得分: 2

以下是您要翻译的内容:

  • 似乎 to_struct 可能是你需要的缺失部分?
  • 我们可以使用 n_field_strategy="max_width" 来确保所有结果具有相同的 "长度"。
  • fields= 可以接受可调用对象,在这种情况下,你想将列名作为前缀添加。
  • 然后,你可以展开生成的结构列。
features = "foo", "bar"

(df.groupby_dynamic(index_column="date", by="id", every="6mo")
   .agg(pl.col(features))
   .with_columns(
      pl.col(feature)
        .arr.to_struct(
           fields = lambda idx, feature=feature: f"{feature}_{idx}", 
           n_field_strategy = "max_width"
        )
      for feature in features
   )
   .unnest(*features)
)
shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

lambda idx, feature=feature: 中的 feature=feature 是由于 Python 中的 lambda 在循环/推导式内部存在延迟绑定问题。

详细信息请参阅:Python官方文档

英文:

It looks like .to_struct may be the missing part you need?

We can use n_field_strategy="max_width" to ensure all results have the same "length".

fields= can take a callable, in this case you want to add the column name as a prefix.

You can then unnest the resulting struct columns:

features = "foo", "bar"

(df.groupby_dynamic(index_column="date", by="id", every="6mo")
   .agg(pl.col(features))
   .with_columns(
      pl.col(feature)
        .arr.to_struct(
           fields = lambda idx, feature=feature: f"{feature}_{idx}", 
           n_field_strategy = "max_width"
        )
      for feature in features
   )
   .unnest(*features)
)
shape: (2, 14)
┌─────┬────────────┬───────┬───────┬───┬───────┬───────┬───────┬───────┐
│ id  ┆ date       ┆ foo_0 ┆ foo_1 ┆ … ┆ bar_2 ┆ bar_3 ┆ bar_4 ┆ bar_5 │
│ --- ┆ ---        ┆ ---   ┆ ---   ┆   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ date       ┆ i64   ┆ i64   ┆   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
╞═════╪════════════╪═══════╪═══════╪═══╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ 2023-01-01 ┆ 0     ┆ 1     ┆ … ┆ 14    ┆ 15    ┆ 16    ┆ 17    │
│ 2   ┆ 2023-07-01 ┆ 6     ┆ 7     ┆ … ┆ 20    ┆ 21    ┆ 22    ┆ 23    │
└─────┴────────────┴───────┴───────┴───┴───────┴───────┴───────┴───────┘

The reason for feature=feature in lambda idx, feature=feature: is due to the late binding issue with lambdas inside loops/comprehensions in Python.

https://docs.python.org/3/faq/programming.html#why-do-lambdas-defined-in-a-loop-with-different-values-all-return-the-same-result

huangapple
  • 本文由 发表于 2023年5月26日 16:46:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76339152.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定