2023年6月18日 17:30:34go评论118阅读模式

英文:

Splitting a lazyframe into two frames by fraction of rows to make a train-test split

问题

我有一个在Polars中的train_test_split函数，可以处理Eager DataFrame。我希望编写一个等效的函数，可以接受LazyFrame作为输入，并返回两个LazyFrames而不对它们进行评估。

我的函数如下所示。它会对所有行进行洗牌，然后基于完整框架的高度使用行索引拆分它。

def train_test_split(
    df: pl.DataFrame, train_fraction: float = 0.75
) -&gt; tuple[pl.DataFrame, pl.DataFrame]:
    &quot;&quot;&quot;将Polars数据框拆分为两个集合。
    Args:
        df (pl.DataFrame): 要拆分的数据框
        train_fraction (float, optional): 分配给训练集的比例。默认为0.75。
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: 训练和测试数据框的元组
    &quot;&quot;&quot;
    df = df.with_columns(pl.all().shuffle(seed=1))
    split_index = int(train_fraction * df.height)
    df_train = df[:split_index]
    df_test = df[split_index:]
    return df_train, df_test


df = pl.DataFrame({&quot;a&quot;: [1, 2, 3, 4], &quot;b&quot;: [4, 3, 2, 1]})
train, test = train_test_split(df)

# 这是上面的示例结果：
train = pl.DataFrame({&#39;a&#39;: [2, 3, 4], &#39;b&#39;: [3, 2, 1]})
test = pl.DataFrame({&#39;a&#39;: [1], &#39;b&#39;: [4]})

然而，LazyFrames的高度是未知的，所以我们必须用另一种方法来处理。我有两个想法，但都遇到了问题：

使用df.sample(frac=train_fraction, with_replacement=False, shuffle=False)。这样我可以获得训练部分，但无法获得测试部分。
添加一个"random"列，其中每行都分配一个介于0和1之间的随机值。然后我可以根据小于我的train_fraction和大于train_fraction的值进行过滤，并将这些值分配给我的训练和测试数据集。但由于我事先不知道数据框的长度，而且（据我所知）Polars没有本地方法来创建这样的列，我需要在每行上应用np.random.uniform的等效操作，这可能非常耗时。
添加一个.with_row_count()并筛选大于总数的某个分数的行，但这里我还需要知道高度，而且创建行数可能很昂贵。

最后，我可能走错了路：我可以事先计算行的总数，但我不知道这是否被认为是昂贵的。

这里有一个用来测试我的函数的大型数据框（需要大约1秒钟来运行它）：

N = 50_000_000
df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=[&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;],
)

英文:

I have a train_test_split function in Polars that can handle an eager DataFrame. I wish to write an equivalent function that can take a LazyFrame as input and return two LazyFrames without evaluating them.

My function is as follows. It shuffles all rows, and then splits it using row-indexing based on the height of the full frame.

def train_test_split(
    df: pl.DataFrame, train_fraction: float = 0.75
) -&gt; tuple[pl.DataFrame, pl.DataFrame]:
    &quot;&quot;&quot;Split polars dataframe into two sets.
    Args:
        df (pl.DataFrame): Dataframe to split
        train_fraction (float, optional): Fraction that goes to train. Defaults to 0.75.
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: Tuple of train and test dataframes
    &quot;&quot;&quot;
    df = df.with_columns(pl.all().shuffle(seed=1))
    split_index = int(train_fraction * df.height)
    df_train = df[:split_index]
    df_test = df[split_index:]
    return df_train, df_test


df = pl.DataFrame({&quot;a&quot;: [1, 2, 3, 4], &quot;b&quot;: [4, 3, 2, 1]})
train, test = train_test_split(df)

# this is what the above looks like:
train = pl.DataFrame({&#39;a&#39;: [2, 3, 4], &#39;b&#39;: [3, 2, 1]})
test = pl.DataFrame({&#39;a&#39;: [1], &#39;b&#39;: [4]})

Lazyframes, however, have unknown height, so we have to do this another way. I have two ideas, but run into issues with both:

Use df.sample(frac=train_fraction, with_replacement=False, shuffle=False). This way I could get the train part, but wouldn't be able to get the test part.
Add a "random" column, where each row gets assigned a random value between 0 and 1. Then I can filter on values below my train_fraction and above train_fraction, and assign these to my train and test datasets respectively. But since I don't know the length of the dataframe beforehand, and (afaik) Polars doesn't have a native way of creating such a column, I would need to .apply an equivalent of np.random.uniform on each row, which would be very time consuming.
Add a .with_row_count() and filter on rows larger than some fraction of the total, but here I also need the height, and creating the row count might be expensive.

Finally, I might be going about this the wrong way: I could count the total number of rows beforehand, but I don't know how expensive this is considered.

Here's a big dataframe to test on (takes ~1 sec) to run my function eagerly:

N = 50_000_000
df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=[&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;],
)

答案1

得分: 2

以下是使用Polars以懒惰模式执行此操作的一种方式，使用with_row_count函数：

def train_test_split_lazy(
    df: pl.DataFrame, train_fraction: float = 0.75
) -> tuple[pl.DataFrame, pl.DataFrame]:
    """将Polars数据框拆分为两个集合。
    Args:
        df (pl.DataFrame): 要拆分的数据框
        train_fraction (float, optional): 分配给训练集的比例，默认为0.75。
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: 训练和测试数据框的元组
    """
    df = df.with_columns(pl.all().shuffle(seed=1)).with_row_count()
    df_train = df.filter(pl.col("row_nr") < pl.col("row_nr").max() * train_fraction)
    df_test = df.filter(pl.col("row_nr") >= pl.col("row_nr").max() * train_fraction)
    return df_train, df_test

然后，您可以使用以下代码调用此函数：

df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=["a", "b", "c", "d", "e"],
).lazy()
train, test = train_test_split_lazy(df_big)

最后，您可以打印训练和测试数据框的内容：

print(train.collect())
print(test.collect())

此代码在我的机器上平均为100次运行时的输出时间为0.455秒。

如果您在train_test_split函数中作弊，将df.height替换为50_000_000，然后以懒惰模式运行它，您将得到相同的输出，在我的机器上平均为100次运行时的时间为0.446秒，性能等效。

英文:

Here is one way in lazy mode to do it with Polars with_row_count:

def train_test_split_lazy(
    df: pl.DataFrame, train_fraction: float = 0.75
) -&gt; tuple[pl.DataFrame, pl.DataFrame]:
    &quot;&quot;&quot;Split polars dataframe into two sets.
    Args:
        df (pl.DataFrame): Dataframe to split
        train_fraction (float, optional): Fraction that goes to train. Defaults to 0.75.
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: Tuple of train and test dataframes
    &quot;&quot;&quot;
    df = df.with_columns(pl.all().shuffle(seed=1)).with_row_count()
    df_train = df.filter(pl.col(&quot;row_nr&quot;) &lt; pl.col(&quot;row_nr&quot;).max() * train_fraction)
    df_test = df.filter(pl.col(&quot;row_nr&quot;) &gt;= pl.col(&quot;row_nr&quot;).max() * train_fraction)
    return df_train, df_test

Then:

df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=[&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;],
).lazy()
train, test = train_test_split_lazy(df_big)

print(train.collect())
print(test.collect())

shape: (37_500_000, 6)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ row_nr   ┆ a        ┆ b        ┆ c        ┆ d        ┆ e        │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32      ┆ i64      ┆ i64      ┆ i64      ┆ i64      ┆ i64      │
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ 0        ┆ 27454110 ┆ 27454110 ┆ 27454110 ┆ 27454110 ┆ 27454110 │
│ 1        ┆ 2309916  ┆ 2309916  ┆ 2309916  ┆ 2309916  ┆ 2309916  │
│ 2        ┆ 15065100 ┆ 15065100 ┆ 15065100 ┆ 15065100 ┆ 15065100 │
│ 3        ┆ 12766444 ┆ 12766444 ┆ 12766444 ┆ 12766444 ┆ 12766444 │
│ …        ┆ …        ┆ …        ┆ …        ┆ …        ┆ …        │
│ 37499996 ┆ 40732880 ┆ 40732880 ┆ 40732880 ┆ 40732880 ┆ 40732880 │
│ 37499997 ┆ 32447037 ┆ 32447037 ┆ 32447037 ┆ 32447037 ┆ 32447037 │
│ 37499998 ┆ 41754221 ┆ 41754221 ┆ 41754221 ┆ 41754221 ┆ 41754221 │
│ 37499999 ┆ 7019133  ┆ 7019133  ┆ 7019133  ┆ 7019133  ┆ 7019133  │
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
shape: (12_500_000, 6)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ row_nr   ┆ a        ┆ b        ┆ c        ┆ d        ┆ e        │
│ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32      ┆ i64      ┆ i64      ┆ i64      ┆ i64      ┆ i64      │
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ 37500000 ┆ 29107559 ┆ 29107559 ┆ 29107559 ┆ 29107559 ┆ 29107559 │
│ 37500001 ┆ 26750366 ┆ 26750366 ┆ 26750366 ┆ 26750366 ┆ 26750366 │
│ 37500002 ┆ 17450938 ┆ 17450938 ┆ 17450938 ┆ 17450938 ┆ 17450938 │
│ 37500003 ┆ 30333846 ┆ 30333846 ┆ 30333846 ┆ 30333846 ┆ 30333846 │
│ …        ┆ …        ┆ …        ┆ …        ┆ …        ┆ …        │
│ 49999996 ┆ 17167194 ┆ 17167194 ┆ 17167194 ┆ 17167194 ┆ 17167194 │
│ 49999997 ┆ 9092583  ┆ 9092583  ┆ 9092583  ┆ 9092583  ┆ 9092583  │
│ 49999998 ┆ 1929693  ┆ 1929693  ┆ 1929693  ┆ 1929693  ┆ 1929693  │
│ 49999999 ┆ 35668469 ┆ 35668469 ┆ 35668469 ┆ 35668469 ┆ 35668469 │

On my machine, I get this output in 0.455 seconds on average for 100 runs.

If I cheat and replace df.height with 50_000_000 in your version of train_test_split, and then run it lazy mode, I get the same output in 0.446 seconds on average for 100 runs, which is equivalent in terms of performance.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

将懒惰帧拆分成两个帧，按行的一部分来进行训练测试拆分。

问题

答案1

如何同时在Polars数据框中转换多列？

Python-polars: Create row per unique value in a pl.DataFrame column, columns with another, and values with a third

在Python Polars中替换一行。

Polars的replace_all()方法参考捕获组进行替换。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论