2023年6月27日 20:01:27go评论127阅读模式

英文:

Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

问题

我试图在一列上执行 `.explode` 操作，并将结果流式传输或写入文件，但其中一个列表包含 30 万个项目（如果合并为字符串，则为 670 万个字符）。

```python
import polars as pl

test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
 .with_columns(explode_me=pl.col('col1').str.split(' '))
 .explode(pl.col('explode_me'))
 .collect(streaming=True)
 .write_parquet('file.parquet')
)

这个问题已经创建了，但是“一个单独的行爆炸超出了内存容量。在当前架构下，我们无法做太多事情。至少，一个单独行的爆炸应该适合内存。”

我该如何最好地将超大的列表拆分为包含更少项目的列表，以便稍后的 .explode 可以适应内存？（可能使用 pl.when()）

基本上，将字符串每 5 万个单词拆分一次，这样我可以爆炸成 6 行，然后稍后爆炸 6 行，每行 5 万个，而不是 1 行 30 万个（会超载内存）。

编辑：
我的当前解决方案

import polars as pl

test = pl.LazyFrame({'col1': 'string ' * 1_000_000})
(test
 .with_columns(explode_me=pl.col('col1').str.split(' '))
 .with_columns(
     pl.col('explode_me').apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                return_dtype=pl.List(pl.List(pl.Utf8))) 
  )
 .select(pl.col('explode_me'))
 .explode(pl.col('explode_me'))
 .sink_parquet('file.parquet')
)


<details>
<summary>英文:</summary>

I&#39;m trying to do an `.explode` on a column, and stream or sink to file, but one of the lists has 300k items (6.7mil characters if combined into a string).

```python
import polars as pl

test = pl.LazyFrame({&#39;col1&#39;: &#39;string &#39;*1_000_000})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .explode(pl.col(&#39;explode_me&#39;))
 .collect(streaming=True)
 .write_parquet(&#39;file.parquet&#39;)
)

This issue was created, but "a single row explodes to more than fits into memory. There is not much what we can do with the current architecture. At absolute minimum, the explosion of a single row should fit."

How do I best split the oversized lists into lists with fewer items so my later .explode will fit into memory? (possibily using pl.when())

Basically, split the string every 50k words so I can explode to 6 rows, so I can then later explode 6 rows of 50k, instead of 1 row of 300k (which overloads memory).

EDIT:
My current solution

import polars as pl

test = pl.LazyFrame({&#39;col1&#39;: &#39;string &#39;*1_000_000})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .with_columns(
     pl.col(&#39;explode_me&#39;).apply(lambda x: [x[i:i+10_000] for i in range(0, len(x), 10_000)], 
                                return_dtype= pl.List(pl.List(pl.Utf8))) 
  )
 .select(pl.col(&#39;explode_me&#39;))
 .explode(pl.col(&#39;explode_me&#39;))
 .sink_parquet(&#39;file.parquet&#39;)
)

答案1

得分: 1

你可以使用list.slice方法将列表分成较小的子列表。以下是使用您的示例（仅包含10个字符串）并将它们分成2个字符串的5列列表的代码。我将这些分块的名称和表达式保存在chunk_cols中，这样您可以在运行时使用任何逻辑来设置分块逻辑。

chunk_cols = {f'chunk{i}': pl.col('explode_me').list.slice(2*i, 2) for i in range(5)}

test = pl.LazyFrame({'col1': ' '.join([f'string_{i}' for i in range(10)])})
(test
 .with_columns(explode_me = pl.col('col1').str.split(' '))
 .with_columns(
     **chunk_cols
 )
 .explode(pl.col('chunk1'))
 .collect(streaming=True)
 .select('col1','chunk1')
)

col1 chunk1
str str
"string_0 strin… "string_2"
"string_0 strin… "string_3"


<details>
<summary>英文:</summary>

You can use the `list.slice` method to chunk the list into smaller lists. The below takes your example (with only 10 strings) and chunks them into 5 columns of lists of two strings. I save the chunk names and expressions in the chunk_cols so you can set the chunking logic at run time with whatever logic you want.
```python
chunk_cols = {f&#39;chunk{i}&#39;: pl.col(&#39;explode_me&#39;).list.slice(2*i, 2) for i in range(5)}

test = pl.LazyFrame({&#39;col1&#39;: &#39; &#39;.join([f&#39;string_{i}&#39; for i in range (10)])})
(test
 .with_columns(explode_me = pl.col(&#39;col1&#39;).str.split(&#39; &#39;))
 .with_columns(
     **chunk_cols
 )
 .explode(pl.col(&#39;chunk1&#39;))
 .collect(streaming=True)
 .select(&#39;col1&#39;,&#39;chunk1&#39;)
)

col1	            chunk1
str	                str
&quot;string_0 strin…	&quot;string_2&quot;
&quot;string_0 strin…	&quot;string_3&quot;

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Given a row with a list too big to explode(), how do I do a preparatory explode() to bring list size down to fit into memory?

问题

答案1

在Python中如何将制表符（\t）添加或插入到列表中

从列表中的字符串开头删除数字字符。

PyMySQL – 如何在’INSERT’语句中使用通配符？

如何在Python中使用文本文件创建列表

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论