英文:
Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?
问题
我试图在一列上执行 `.explode` 操作,并将结果流式传输或写入文件,但其中一个列表包含 30 万个项目(如果合并为字符串,则为 670 万个字符)。
```python
import polars as pl
test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
.with_columns(explode_me=pl.col('col1').str.split(' '))
.explode(pl.col('explode_me'))
.collect(streaming=True)
.write_parquet('file.parquet')
)
这个问题 已经创建了,但是“一个单独的行爆炸超出了内存容量。在当前架构下,我们无法做太多事情。至少,一个单独行的爆炸应该适合内存。”
我该如何最好地将超大的列表拆分为包含更少项目的列表,以便稍后的 .explode
可以适应内存?(可能使用 pl.when()
)
基本上,将字符串每 5 万个单词拆分一次,这样我可以爆炸成 6 行,然后稍后爆炸 6 行,每行 5 万个,而不是 1 行 30 万个(会超载内存)。
编辑:
我的当前解决方案
import polars as pl
test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
.with_columns(explode_me=pl.col('col1').str.split(' '))
.with_columns(
pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)],
return_dtype=pl.List(pl.List(pl.Utf8)))
)
.select(pl.col('explode_me'))
.explode(pl.col('explode_me'))
.sink_parquet('file.parquet')
)
<details>
<summary>英文:</summary>
I'm trying to do an `.explode` on a column, and stream or sink to file, but one of the lists has 300k items (6.7mil characters if combined into a string).
```python
import polars as pl
test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.explode(pl.col('explode_me'))
.collect(streaming=True)
.write_parquet('file.parquet')
)
This issue was created, but "a single row explodes to more than fits into memory. There is not much what we can do with the current architecture. At absolute minimum, the explosion of a single row should fit."
How do I best split the oversized lists into lists with fewer items so my later .explode
will fit into memory? (possibily using pl.when()
)
Basically, split the string every 50k words so I can explode to 6 rows, so I can then later explode 6 rows of 50k, instead of 1 row of 300k (which overloads memory).
EDIT:
My current solution
import polars as pl
test = pl.LazyFrame({'col1': 'string '*1_000_000})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.with_columns(
pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)],
return_dtype= pl.List(pl.List(pl.Utf8)))
)
.select(pl.col('explode_me'))
.explode(pl.col('explode_me'))
.sink_parquet('file.parquet')
)
答案1
得分: 1
你可以使用list.slice
方法将列表分成较小的子列表。以下是使用您的示例(仅包含10个字符串)并将它们分成2个字符串的5列列表的代码。我将这些分块的名称和表达式保存在chunk_cols
中,这样您可以在运行时使用任何逻辑来设置分块逻辑。
chunk_cols = {f'chunk{i}': pl.col('explode_me').list.slice(2*i, 2) for i in range(5)}
test = pl.LazyFrame({'col1': ' '.join([f'string_{i}' for i in range(10)])})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.with_columns(
**chunk_cols
)
.explode(pl.col('chunk1'))
.collect(streaming=True)
.select('col1','chunk1')
)
col1 chunk1
str str
"string_0 strin… "string_2"
"string_0 strin… "string_3"
<details>
<summary>英文:</summary>
You can use the `list.slice` method to chunk the list into smaller lists. The below takes your example (with only 10 strings) and chunks them into 5 columns of lists of two strings. I save the chunk names and expressions in the chunk_cols so you can set the chunking logic at run time with whatever logic you want.
```python
chunk_cols = {f'chunk{i}': pl.col('explode_me').list.slice(2*i, 2) for i in range(5)}
test = pl.LazyFrame({'col1': ' '.join([f'string_{i}' for i in range (10)])})
(test
.with_columns(explode_me = pl.col('col1').str.split(' '))
.with_columns(
**chunk_cols
)
.explode(pl.col('chunk1'))
.collect(streaming=True)
.select('col1','chunk1')
)
col1 chunk1
str str
"string_0 strin… "string_2"
"string_0 strin… "string_3"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论