2023年6月27日 20:07:07go评论107阅读模式

英文:

Polars Shuffle And Split DataFrame With Grouping

问题

我正在使用Polars进行所有的预处理和特征工程。我想在执行训练/验证/测试数据拆分之前对数据进行洗牌。

一个训练的“示例”由多行组成。每个示例的行数不同。这里是一个简单的人为示例（请注意，实际上我在我的代码中使用了LazyFrame）：

pl.DataFrame({
  "example_id": [1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
  "other_col": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

我想要对example_id列进行洗牌，同时保持示例分组在一起。产生类似于这样的结果：

pl.DataFrame({
  "example_id": [2, 2, 2, 1, 1, 4, 4, 3, 3, 3],
  "other_col": [3, 4, 5, 1, 2, 9, 10, 6, 7, 8]
})

然后，我想要按比例拆分数据，例如0.6、0.2、0.2分别用于训练、验证和测试，但是要基于“整个示例”而不是单独的行进行拆分。

在Polars中，可以使用以下方法实现这一目标：

首先，使用shuffle方法对数据进行洗牌，但保持示例分组在一起：

df = df.shuffle("example_id")

然后，可以使用groupby方法按照example_id进行分组，然后使用split方法将数据按比例拆分为训练、验证和测试集：

train_df, valid_df, test_df = df.groupby("example_id").split([0.6, 0.2, 0.2])

这将根据示例分组进行拆分，而不会将示例内的行拆分开来。这样，您可以实现您所需的洗牌和拆分操作，而无需将example_id转换为数组并手动处理。

希望这对您有所帮助！

英文:

I am using polars for all preprocessing and feature engineering. I want to shuffle the data before performing a train/valid/test split.

A training 'example' consists of multiple rows. The number of rows per example varies. Here is a simple contrived example (Note I am actually using a LazyFrame in my code):

pl.DataFrame({
  &quot;example_id&quot;: [1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
  &quot;other_col&quot;: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 1          ┆ 1         │
│ 1          ┆ 2         │
│ 2          ┆ 3         │
│ 2          ┆ 4         │
│ 2          ┆ 5         │
│ 3          ┆ 6         │
│ 3          ┆ 7         │
│ 3          ┆ 8         │
│ 4          ┆ 9         │
│ 4          ┆ 10        │
└────────────┴───────────┘

I want to shuffle 'over' the example_id column, while keeping the examples grouped together. Producing a result something like this:

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 2          ┆ 3         │
│ 2          ┆ 4         │
│ 2          ┆ 5         │
│ 1          ┆ 1         │
│ 1          ┆ 2         │
│ 4          ┆ 9         │
│ 4          ┆ 10        │
│ 3          ┆ 6         │
│ 3          ┆ 7         │
│ 3          ┆ 8         │
└────────────┴───────────┘

I then want to split the data fractionally, for example 0.6, 0.2, 0.2 for training, validation and testing respectively, but do this based on 'whole examples' rather than just row wise.

Is there a clean way to do this in polars without having to convert the example_id to an array, shuffling it, splitting into sublists, then reselecting from the original dataframe?

答案1

得分: 1

以下是翻译好的部分：

获取唯一的 example_ids，随机打乱它们并添加行计数。

example_ids = (
  example_df
  .select("example_id")
  .unique()
  .sample(fraction=1, shuffle=True)
  .with_row_count()
)

使用行计数将唯一的标识符分成子集。

# 假设我们将在剩余数据上进行测试
train_frac = 0.6
valid_frac = 0.2
train_ids = example_ids.filter(
  pl.col("row_nr") < pl.col("row_nr").max() * train_frac
)
valid_ids = example_ids.filter(
    pl.col("row_nr").is_between(
        pl.col("row_nr").max() * train_frac,
        pl.col("row_nr").max() * (train_frac + valid_frac),
    )
)
test_ids = example_ids.filter(
    pl.col("row_nr") > pl.col("row_nr").max() * (train_frac + valid_frac)
)

将每个子集与 example_df 连接并删除 row_nr 列。

train_df = example_df.join(train_ids, on="example_id").drop("row_nr")
valid_df = example_df.join(valid_ids, on="example_id").drop("row_nr")
test_df = example_df.join(test_ids, on="example_id").drop("row_nr")

这将生成3个类似以下的数据框：

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 1          ┆ 1         │
│ 1          ┆ 2         │
│ 3          ┆ 6         │
│ 3          ┆ 7         │
│ 3          ┆ 8         │
└────────────┴───────────┘

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 2          ┆ 3         │
│ 2          ┆ 4         │
│ 2          ┆ 5         │
└────────────┴───────────┘

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 4          ┆ 9         │
│ 4          ┆ 10        │
└────────────┴───────────┘

英文:

There must be a far cleaner way of achieving this, hopefully someone can improve on this. Also it requires collecting the dataframe which is not ideal. Either way, it seems to work for now. Thanks @jqurious for the pointer.

Grab the unique example_ids, shuffle them and add a row count.

example_ids = (
  example_df
  .select(&quot;example_id&quot;)
  .unique()
  .sample(fraction=1, shuffle=True)
  .with_row_count()
)

Split the unique ids into subsets using the row count.

# assume we&#39;ll test on remaining data
train_frac = 0.6
valid_frac = 0.2
train_ids = example_ids.filter(
  pl.col(&quot;row_nr&quot;) &lt; pl.col(&quot;row_nr&quot;).max() * train_frac
)
valid_ids = example_ids.filter(
    pl.col(&quot;row_nr&quot;).is_between(
        pl.col(&quot;row_nr&quot;).max() * train_frac,
        pl.col(&quot;row_nr&quot;).max() * (train_frac + valid_frac),
    )
)
test_ids = example_ids.filter(
    pl.col(&quot;row_nr&quot;) &gt; pl.col(&quot;row_nr&quot;).max() * (train_frac + valid_frac)
)

Join each subset back to the example_df and drop the row_nr

train_df = example_df.join(train_ids, on=&quot;example_id&quot;).drop(&quot;row_nr&quot;)
valid_df = example_df.join(valid_ids, on=&quot;example_id&quot;).drop(&quot;row_nr&quot;)
test_df = example_df.join(test_ids, on=&quot;example_id&quot;).drop(&quot;row_nr&quot;)

This will produce 3 dataframe, something like this

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 1          ┆ 1         │
│ 1          ┆ 2         │
│ 3          ┆ 6         │
│ 3          ┆ 7         │
│ 3          ┆ 8         │
└────────────┴───────────┘

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 2          ┆ 3         │
│ 2          ┆ 4         │
│ 2          ┆ 5         │
└────────────┴───────────┘

┌────────────┬───────────┐
│ example_id ┆ other_col │
│ ---        ┆ ---       │
│ i64        ┆ i64       │
╞════════════╪═══════════╡
│ 4          ┆ 9         │
│ 4          ┆ 10        │
└────────────┴───────────┘

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Polars分组洗牌和拆分数据框。

问题

答案1

Can't figure out the error "Exception: Please create folder structure: dataset"

The problem is that the linear image correlation does not have any effect on the images and they appear the same as the original image

如何迭代包含列表的元组列表并依次对元素进行分组？

while循环在Python中不会停止运行。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。