Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

huangapple go评论59阅读模式
英文:

Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

问题

我试图在一列上执行 `.explode` 操作并将结果流式传输或写入文件但其中一个列表包含 30 万个项目如果合并为字符串则为 670 万个字符)。

```python
import polars as pl

test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
 .with_columns(explode_me=pl.col('col1').str.split(' '))
 .explode(pl.col('explode_me'))
 .collect(streaming=True)
 .write_parquet('file.parquet')
)

这个问题 已经创建了,但是“一个单独的行爆炸超出了内存容量。在当前架构下,我们无法做太多事情。至少,一个单独行的爆炸应该适合内存。”

我该如何最好地将超大的列表拆分为包含更少项目的列表,以便稍后的 .explode 可以适应内存?(可能使用 pl.when()

基本上,将字符串每 5 万个单词拆分一次,这样我可以爆炸成 6 行,然后稍后爆炸 6 行,每行 5 万个,而不是 1 行 30 万个(会超载内存)。

编辑:
我的当前解决方案

import polars as pl

test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
 .with_columns(explode_me=pl.col('col1').str.split(' '))
 .with_columns(
     pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                return_dtype=pl.List(pl.List(pl.Utf8))) 
  )
 .select(pl.col('explode_me'))
 .explode(pl.col('explode_me'))
 .sink_parquet('file.parquet')
)

<details>
<summary>英文:</summary>

I&#39;m trying to do an `.explode` on a column, and stream or sink to file, but one of the lists has 300k items (6.7mil characters if combined into a string).

```python
import polars as pl

test = pl.LazyFrame({&#39;col1&#39;: &#39;string &#39;*1_000_000})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .explode(pl.col(&#39;explode_me&#39;))
 .collect(streaming=True)
 .write_parquet(&#39;file.parquet&#39;)
)

This issue was created, but "a single row explodes to more than fits into memory. There is not much what we can do with the current architecture. At absolute minimum, the explosion of a single row should fit."

How do I best split the oversized lists into lists with fewer items so my later .explode will fit into memory? (possibily using pl.when())

Basically, split the string every 50k words so I can explode to 6 rows, so I can then later explode 6 rows of 50k, instead of 1 row of 300k (which overloads memory).

EDIT:
My current solution

import polars as pl

test = pl.LazyFrame({&#39;col1&#39;: &#39;string &#39;*1_000_000})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .with_columns(
     pl.col(&#39;explode_me&#39;).apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                return_dtype= pl.List(pl.List(pl.Utf8))) 
  )
 .select(pl.col(&#39;explode_me&#39;))
 .explode(pl.col(&#39;explode_me&#39;))
 .sink_parquet(&#39;file.parquet&#39;)
)

答案1

得分: 1

你可以使用list.slice方法将列表分成较小的子列表。以下是使用您的示例(仅包含10个字符串)并将它们分成2个字符串的5列列表的代码。我将这些分块的名称和表达式保存在chunk_cols中,这样您可以在运行时使用任何逻辑来设置分块逻辑。

chunk_cols = {f'chunk{i}': pl.col('explode_me').list.slice(2*i, 2) for i in range(5)}

test = pl.LazyFrame({'col1': ' '.join([f'string_{i}' for i in range(10)])})
(test
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(
     **chunk_cols
 )
 .explode(pl.col('chunk1'))
 .collect(streaming=True)
 .select('col1','chunk1')
)

col1 chunk1
str str
"string_0 strin… "string_2"
"string_0 strin… "string_3"


<details>
<summary>英文:</summary>

You can use the `list.slice` method to chunk the list into smaller lists. The below takes your example (with only 10 strings) and chunks them into 5 columns of lists of two strings. I save the chunk names and expressions in the chunk_cols so you can set the chunking logic at run time with whatever logic you want.
```python
chunk_cols = {f&#39;chunk{i}&#39;: pl.col(&#39;explode_me&#39;).list.slice(2*i, 2) for i in range(5)}

test = pl.LazyFrame({&#39;col1&#39;: &#39; &#39;.join([f&#39;string_{i}&#39; for i in range (10)])})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .with_columns(
     **chunk_cols
 )
 .explode(pl.col(&#39;chunk1&#39;))
 .collect(streaming=True)
 .select(&#39;col1&#39;,&#39;chunk1&#39;)
)

col1	            chunk1
str	                str
&quot;string_0 strin…	&quot;string_2&quot;
&quot;string_0 strin…	&quot;string_3&quot;

huangapple
  • 本文由 发表于 2023年6月27日 20:01:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76564658.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定