将懒惰帧拆分成两个帧,按行的一部分来进行训练测试拆分。

huangapple go评论100阅读模式
英文:

Splitting a lazyframe into two frames by fraction of rows to make a train-test split

问题

我有一个在Polars中的train_test_split函数,可以处理Eager DataFrame。我希望编写一个等效的函数,可以接受LazyFrame作为输入,并返回两个LazyFrames而不对它们进行评估。

我的函数如下所示。它会对所有行进行洗牌,然后基于完整框架的高度使用行索引拆分它。

def train_test_split(
    df: pl.DataFrame, train_fraction: float = 0.75
) -> tuple[pl.DataFrame, pl.DataFrame]:
    """将Polars数据框拆分为两个集合
    Args:
        df (pl.DataFrame): 要拆分的数据框
        train_fraction (float, optional): 分配给训练集的比例默认为0.75
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: 训练和测试数据框的元组
    """
    df = df.with_columns(pl.all().shuffle(seed=1))
    split_index = int(train_fraction * df.height)
    df_train = df[:split_index]
    df_test = df[split_index:]
    return df_train, df_test


df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [4, 3, 2, 1]})
train, test = train_test_split(df)

# 这是上面的示例结果:
train = pl.DataFrame({'a': [2, 3, 4], 'b': [3, 2, 1]})
test = pl.DataFrame({'a': [1], 'b': [4]})

然而,LazyFrames的高度是未知的,所以我们必须用另一种方法来处理。我有两个想法,但都遇到了问题:

  1. 使用df.sample(frac=train_fraction, with_replacement=False, shuffle=False)。这样我可以获得训练部分,但无法获得测试部分。
  2. 添加一个"random"列,其中每行都分配一个介于0和1之间的随机值。然后我可以根据小于我的train_fraction和大于train_fraction的值进行过滤,并将这些值分配给我的训练和测试数据集。但由于我事先不知道数据框的长度,而且(据我所知)Polars没有本地方法来创建这样的列,我需要在每行上应用np.random.uniform的等效操作,这可能非常耗时。
  3. 添加一个.with_row_count()并筛选大于总数的某个分数的行,但这里我还需要知道高度,而且创建行数可能很昂贵。

最后,我可能走错了路:我可以事先计算行的总数,但我不知道这是否被认为是昂贵的。

这里有一个用来测试我的函数的大型数据框(需要大约1秒钟来运行它):

N = 50_000_000
df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=["a", "b", "c", "d", "e"],
)

英文:

I have a train_test_split function in Polars that can handle an eager DataFrame. I wish to write an equivalent function that can take a LazyFrame as input and return two LazyFrames without evaluating them.

My function is as follows. It shuffles all rows, and then splits it using row-indexing based on the height of the full frame.

def train_test_split(
    df: pl.DataFrame, train_fraction: float = 0.75
) -> tuple[pl.DataFrame, pl.DataFrame]:
    """Split polars dataframe into two sets.
    Args:
        df (pl.DataFrame): Dataframe to split
        train_fraction (float, optional): Fraction that goes to train. Defaults to 0.75.
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: Tuple of train and test dataframes
    """
    df = df.with_columns(pl.all().shuffle(seed=1))
    split_index = int(train_fraction * df.height)
    df_train = df[:split_index]
    df_test = df[split_index:]
    return df_train, df_test


df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [4, 3, 2, 1]})
train, test = train_test_split(df)

# this is what the above looks like:
train = pl.DataFrame({'a': [2, 3, 4], 'b': [3, 2, 1]})
test = pl.DataFrame({'a': [1], 'b': [4]})

Lazyframes, however, have unknown height, so we have to do this another way. I have two ideas, but run into issues with both:

  1. Use df.sample(frac=train_fraction, with_replacement=False, shuffle=False). This way I could get the train part, but wouldn't be able to get the test part.
  2. Add a "random" column, where each row gets assigned a random value between 0 and 1. Then I can filter on values below my train_fraction and above train_fraction, and assign these to my train and test datasets respectively. But since I don't know the length of the dataframe beforehand, and (afaik) Polars doesn't have a native way of creating such a column, I would need to .apply an equivalent of np.random.uniform on each row, which would be very time consuming.
  3. Add a .with_row_count() and filter on rows larger than some fraction of the total, but here I also need the height, and creating the row count might be expensive.

Finally, I might be going about this the wrong way: I could count the total number of rows beforehand, but I don't know how expensive this is considered.

Here's a big dataframe to test on (takes ~1 sec) to run my function eagerly:

N = 50_000_000
df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=["a", "b", "c", "d", "e"],
)

答案1

得分: 2

以下是使用Polars以懒惰模式执行此操作的一种方式,使用with_row_count函数:

def train_test_split_lazy(
    df: pl.DataFrame, train_fraction: float = 0.75
) -> tuple[pl.DataFrame, pl.DataFrame]:
    """将Polars数据框拆分为两个集合。
    Args:
        df (pl.DataFrame): 要拆分的数据框
        train_fraction (float, optional): 分配给训练集的比例,默认为0.75。
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: 训练和测试数据框的元组
    """
    df = df.with_columns(pl.all().shuffle(seed=1)).with_row_count()
    df_train = df.filter(pl.col("row_nr") < pl.col("row_nr").max() * train_fraction)
    df_test = df.filter(pl.col("row_nr") >= pl.col("row_nr").max() * train_fraction)
    return df_train, df_test

然后,您可以使用以下代码调用此函数:

df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=["a", "b", "c", "d", "e"],
).lazy()
train, test = train_test_split_lazy(df_big)

最后,您可以打印训练和测试数据框的内容:

print(train.collect())
print(test.collect())

此代码在我的机器上平均为100次运行时的输出时间为0.455秒

如果您在train_test_split函数中作弊,将df.height替换为50_000_000,然后以懒惰模式运行它,您将得到相同的输出,在我的机器上平均为100次运行时的时间为0.446秒,性能等效。

英文:

Here is one way in lazy mode to do it with Polars with_row_count:

def train_test_split_lazy(
    df: pl.DataFrame, train_fraction: float = 0.75
) -&gt; tuple[pl.DataFrame, pl.DataFrame]:
    &quot;&quot;&quot;Split polars dataframe into two sets.
    Args:
        df (pl.DataFrame): Dataframe to split
        train_fraction (float, optional): Fraction that goes to train. Defaults to 0.75.
    Returns:
        Tuple[pl.DataFrame, pl.DataFrame]: Tuple of train and test dataframes
    &quot;&quot;&quot;
    df = df.with_columns(pl.all().shuffle(seed=1)).with_row_count()
    df_train = df.filter(pl.col(&quot;row_nr&quot;) &lt; pl.col(&quot;row_nr&quot;).max() * train_fraction)
    df_test = df.filter(pl.col(&quot;row_nr&quot;) &gt;= pl.col(&quot;row_nr&quot;).max() * train_fraction)
    return df_train, df_test

Then:

df_big = pl.DataFrame(
    [
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
        pl.arange(0, N, eager=True),
    ],
    schema=[&quot;a&quot;, &quot;b&quot;, &quot;c&quot;, &quot;d&quot;, &quot;e&quot;],
).lazy()
train, test = train_test_split_lazy(df_big)
print(train.collect())
print(test.collect())

shape: (37_500_000, 6)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
 row_nr    a         b         c         d         e        
 ---       ---       ---       ---       ---       ---      
 u32       i64       i64       i64       i64       i64      
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
 0         27454110  27454110  27454110  27454110  27454110 
 1         2309916   2309916   2309916   2309916   2309916  
 2         15065100  15065100  15065100  15065100  15065100 
 3         12766444  12766444  12766444  12766444  12766444 
                                                      
 37499996  40732880  40732880  40732880  40732880  40732880 
 37499997  32447037  32447037  32447037  32447037  32447037 
 37499998  41754221  41754221  41754221  41754221  41754221 
 37499999  7019133   7019133   7019133   7019133   7019133  
└──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
shape: (12_500_000, 6)
┌──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
 row_nr    a         b         c         d         e        
 ---       ---       ---       ---       ---       ---      
 u32       i64       i64       i64       i64       i64      
╞══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
 37500000  29107559  29107559  29107559  29107559  29107559 
 37500001  26750366  26750366  26750366  26750366  26750366 
 37500002  17450938  17450938  17450938  17450938  17450938 
 37500003  30333846  30333846  30333846  30333846  30333846 
                                                      
 49999996  17167194  17167194  17167194  17167194  17167194 
 49999997  9092583   9092583   9092583   9092583   9092583  
 49999998  1929693   1929693   1929693   1929693   1929693  
 49999999  35668469  35668469  35668469  35668469  35668469 

On my machine, I get this output in 0.455 seconds on average for 100 runs.

If I cheat and replace df.height with 50_000_000 in your version of train_test_split, and then run it lazy mode, I get the same output in 0.446 seconds on average for 100 runs, which is equivalent in terms of performance.

huangapple
  • 本文由 发表于 2023年6月18日 17:30:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76499865.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定