2023年5月11日 01:24:26go评论104阅读模式

英文:

Refactoring pandas using an iterator via chunksize

问题

我在寻求关于如何使用 pandas 迭代器的建议。

我使用 Python pandas 执行了一次解析操作，输入文件的大小（一个名为 eggNOG 的生物信息学程序）导致了 'RAM 瓶颈' 现象。它根本无法处理这个文件。

明显的解决方案是切换到迭代器，在 pandas 中使用的是 chunksize 选项。

import pandas as pd
import numpy as np
df = pd.read_csv('myinfile.csv', sep='\t', chunksize=100)

原始代码中的更改是 chunksize=100 部分，强制使用迭代器。

下一步只是执行一个简单的操作，删除一些列，并将所有的 '-' 字符替换为 np.nan，然后将整个文件写入。

df.drop(['score', 'evalue', 'Description', 'EC', 'PFAMs'], axis=1).replace('-', np.nan)
df.to_csv('my.csv', sep='\t', index=False)

如何在 pandas 迭代器下完成这个操作？

更新

解决方案在下面的答案中描述，包括两个组成部分：

不要在源点加载垃圾数据：我上传并删除了大量垃圾数据（不好）
利用 open 在分块循环外部。这会强制保持每个块的输出流打开，然后在最后一个块关闭。

输出文件包含了重复的内容。这是不可避免的，因为它们分布在不同的块中，通过以下方式移除，即通过以下方式减少：

df = df.groupby(['compoundIndex'])['Frequency'].sum().to_frame()

这导致输出文件与非迭代器方法相同，通过调整 chunksize 可以克服任何级别的 "RAM 瓶颈"。实际代码是一个面向对象的模块 - 相当复杂的解析 - 下面的代码可以直接适用。

很酷。

英文:

I am looking for advice on using a pandas iterator.

I performed a parsing operation using Python pandas, the size of the input files (a bioinformatics program called eggNOG) is resulting in 'RAM bottleneck' phenomenon. It's just not processing the file.

The obvious solution is to shift to an iterator, which for pandas is the chunksize option

import pandas as pd
import numpy as np
df = pd.read_csv(&#39;myinfile.csv&#39;, sep=&quot;\t&quot;, chunksize=100)

Whats changed with the original code is the chunksize=100 bit, forcing an iterator.

The next step is just to perform a simple operation, dropping a few columns and moving all '-' characters to np.nan then writing the whole file.

df.drop([&#39;score&#39;, &#39;evalue&#39;, &#39;Description&#39;, &#39;EC&#39;, &#39;PFAMs&#39;],axis=1).replace(&#39;-&#39;, np.nan)
df.to_csv(&#39;my.csv&#39;,sep=&#39;\t&#39;,index=False)

How is this done under a pandas iterator?

Update

The solution is described in answers below and comprised two components:

Don't load junk at point source: I uploaded and deleted lots of junk (not good)
Leveraging open outside the chunking loop. This forces the outfield to remain open for each chunk to be written and shuts on the last chunk.

The outfile comprised duplicates. Its inevitable because they split across different chunks and these removed, i.e. reduced, via

df = df.groupby([&#39;compoundIndex&#39;])[&#39;Frequency&#39;].sum().to_frame()

This resulted in the outfile being identical to the non-iterator method and by manipulating chunksize any level of "RAM bottleneck" could be overcome. The actual code is an OO module - reasonably complex parsing - and the codes below fitted straight in.

Cool.

答案1

得分: 2

IIUC，您可以执行以下操作：

cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
data = []
for chunk in pd.read_csv('myinfile.csv', sep='\t', na_values='-', chunksize=100):
    chunk = chunk.drop(columns=cols_to_drop)
    data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='\t', index=False)

如果您知道要保留的列而不是要删除的列，请使用：

cols_to_keep = ['col1', 'col2', 'col3']
data = []
for chunk in pd.read_csv('myinfile.csv', usecols=cols_to_keep, sep='\t', na_values='-', chunksize=100):
    data.append(chunk)
pd.concat(data).to_csv('my.csv', sep='\t', index=False)

受 @el_oso 启发的另一种方法：

cols_to_drop = ['score', 'evalue', 'Description', 'EC', 'PFAMs']
with open('myinfile.csv') as inp, open('my.csv', 'w') as out:
    headers = inp.readline().split('\t')
    out.write('\t'.join([col for col in headers if col not in cols_to_drop]))
    for chunk in pd.read_csv(inp, header=None, names=headers, sep='\t', na_values='-', chunksize=100):
        chunk = chunk.drop(columns=cols_to_drop)
        chunk.to_csv(out, sep='\t', index=False, header=False)

英文:

IIUC, you can do:

cols_to_drop = [&#39;score&#39;, &#39;evalue&#39;, &#39;Description&#39;, &#39;EC&#39;, &#39;PFAMs&#39;]
data = []
for chunk in pd.read_csv(&#39;myinfile.csv&#39;, sep=&#39;\t&#39;, na_values=&#39;-&#39;, chunksize=100):
    chunk = chunk.drop(columns=cols_to_drop)
    data.append(chunk)
pd.concat(data).to_csv(&#39;my.csv&#39;, sep=&#39;\t&#39;, index=False)

If you know the columns you want to keep instead of which ones you want to drop, use:

cols_to_keep = [&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;]
data = []
for chunk in pd.read_csv(&#39;myinfile.csv&#39;, usecols=cols_to_keep, usesep=&#39;\t&#39;, na_values=&#39;-&#39;, chunksize=100):
    data.append(chunk)
pd.concat(data).to_csv(&#39;my.csv&#39;, sep=&#39;\t&#39;, index=False)

Alternative inspired by @el_oso:

cols_to_drop = [&#39;score&#39;, &#39;evalue&#39;, &#39;Description&#39;, &#39;EC&#39;, &#39;PFAMs&#39;]
with (open(&#39;myinfile.csv&#39;) as inp,
      open(&#39;my.csv&#39;, &#39;w&#39;) as out):
    headers = inp.readline().split(&#39;\t&#39;)
    out.write(&#39;\t&#39;.join([col for col in headers if col not in cols_to_drop]))
    for chunk in pd.read_csv(inp, header=None, names=headers, sep=&#39;\t&#39;, na_values=&#39;-&#39;, chunksize=100):
        chunk = chunk.drop(columns=cols_to_drop)
        chunk.to_csv(out, sep=&#39;\t&#39;, index=False, header=False)

答案2

得分: 1

以下是您要翻译的内容：

如果您已经在读取文件时遇到内存问题，下面的代码将实际以块的方式读取并写入新文件。

cols_to_keep = {'col1', 'col2', 'col3'}
df = pd.read_csv('myinfile.csv', chunksize=10, usecols=cols_to_keep)
with open('my.csv', 'w') as f:
    f.write('\t'.join(cols_to_keep) + '\n')
    for chunk in df:
        # 在这里进行处理，将该块附加到文件中
        chunk.to_csv(f, header=None, sep='\t')

f 是一个_io.TextIOWrapper，即您可以持续以小内存块写入的流。

英文:

if you're already having memory issues reading the file, the below will actually read and write to the new file in chunks.

cols_to_keep = {&#39;col1&#39;, &#39;col2&#39;, &#39;col3&#39;}
df = pd.read_csv(&#39;myinfile.csv&#39;, chunksize=10, usecols=cols_to_keep)
with open(&#39;my.csv&#39;, &#39;w&#39;) as f:
    f.write(&#39;\t&#39;.join(cols_to_keep) + &#39;\n&#39;)
    for chunk in df:
        #do your processing here, appending that chunk to the file
        chunk.to_csv(f, header=None, sep=&#39;\t&#39;)

f is _io.TextIOWrapper i.e. a stream you can consistently write to in small memory chunks

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用迭代器通过chunksize重构pandas

问题

答案1

答案2

获取两个Pandas系列之间对象计数字典的最快方法

使用ffill（或其他方法）在pandas数据框中更新列中的多个值

为什么我无法阅读这些文件？多个库无法找到它们。

从列输出中提取两个字母的州缩写到新列中。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。