2023年1月9日 18:17:20go评论151阅读模式

英文:

Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?

问题

我有一个大型的CSV文件，想要在pandas中将其读入数据框并执行操作。不幸的是，Dask不是一个选项。因此，我会按块处理CSV。以下是一个使用块大小为5000000的示例：

import pandas as pd
import csv
import os

dtypes = {"Column1": str, "Column2": str, "Column3": str, "Column4": str}

output_path = r'C:\myfolder\test.csv'

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding='utf-8',
                    index=False,
                    sep=';',
                    decimal=',',
                    date_format='%d.%m.%Y',
                    quoting=csv.QUOTE_MINIMAL)

这需要一些时间，但可以正常工作。

现在，我想选择一个子集并将此子集导出到一个新的CSV文件。假设我的原始文件如下：

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10

现在，我想选择那些在Column2的前两个字符为"5C"且Column3等于"B"的记录：

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10

首先，我尝试了一个条件：

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[chunk['Column3'] == 'B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)

这个方法可以正常工作（使用相同的块大小5000000）。

然而，现在我有多个条件。开始实现第二个条件时，我尝试了以下代码：

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)

现在遇到了一个问题：它抛出了内存错误（MemoryError: Unable to allocate...）。错误不会在第一个块出现，但在后面出现。似乎这需要比只有一个条件更多的RAM计算资源。我有点困惑为什么前面的块没有问题，而在后面的块（更接近文件末尾）会出现内存错误。看起来这些块中的数据更符合筛选条件，因此存储和操作的数据更多。

因此，我将块大小减小到2000000，同样的内存错误。我尝试了1000000，仍然出现内存错误。只有500000时才能正常工作。现在我不确定是否“自然”需要降低块大小，因为使用更多条件会增加计算资源的需求，还是我的连接条件的方式 - 编程/实现 - 这一行：filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]是否低效？所以是否有更高效的方法来选择数据并输出它？在这个示例中，我使用了2个条件，但我将有更多，最多6个。我将继续使用and或or运算符将它们组合到filtered行中。

英文:

I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:

import pandas as pd
import csv
import os

dtypes= {     &quot;Column1&quot;: str, &quot;Column2&quot;: str, &quot;Column3&quot;: str, &quot;Column4&quot;: str,
        }

output_path=r&#39;C:\myfolder\test.csv&#39;

with pd.read_csv(r&#39;C:\myfolder\large_file.csv&#39;, sep=&quot;;&quot;, encoding=&quot;utf-8&quot;, dtype=dtypes, decimal=&quot;,&quot;, chunksize=5000000) as reader:
    for chunk in reader: 
        chunk.to_csv(output_path, mode=&#39;a&#39;, header=not os.path.exists(output_path),
                    encoding=&quot;utf-8&quot;,
                    index=False,
                    sep=&quot;;&quot;,
                    decimal=&quot;,&quot;,
                    date_format=&quot;%d.%m.%Y&quot;,
                    quoting=csv.QUOTE_MINIMAL)

It takes a while, but works.

Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10

Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B":
The output should be:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10

Starting with one criteria I tried first:

with pd.read_csv(r&#39;C:\myfolder\large_file.csv&#39;, sep=&quot;;&quot;, encoding=&quot;utf-8&quot;, dtype=dtypes, decimal=&quot;,&quot;, chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[chunk[&#39;Column3&#39;]==&#39;B&#39;]
        filtered.to_csv(output_path, mode=&#39;a&#39;, header=not os.path.exists(output_path),
                    encoding=&quot;utf-8&quot;,
                    index=False,
                    sep=&quot;;&quot;,
                    decimal=&quot;,&quot;,
                    date_format=&quot;%d.%m.%Y&quot;,
                    quoting=csv.QUOTE_MINIMAL)

and it worked (with the same chunk size of 5000000).

However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:

with pd.read_csv(r&#39;C:\myfolder\large_file.csv&#39;, sep=&quot;;&quot;, encoding=&quot;utf-8&quot;, dtype=dtypes, decimal=&quot;,&quot;, chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[(chunk[&#39;Column3&#39;]==&#39;B&#39;) &amp; (chunk[&#39;Column2&#39;].str[:2]==&quot;5C&quot;)]
        filtered.to_csv(output_path, mode=&#39;a&#39;, header=not os.path.exists(output_path),
                    encoding=&quot;utf-8&quot;,
                    index=False,
                    sep=&quot;;&quot;,
                    decimal=&quot;,&quot;,
                    date_format=&quot;%d.%m.%Y&quot;,
                    quoting=csv.QUOTE_MINIMAL)

And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.

I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")] is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.

答案1

得分: 1

我认为你完全可以在不使用Pandas的情况下进行流处理预处理，只需使用内置的csv模块。

import csv

output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"

with open(output_path, "w", encoding="utf-8") as out_f:
    cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw.writerow(next(cr))  # 复制标题
        for line in cr:  # `line` 是一个字符串元组
            if line[2] == "B" and line[1][:2] == "5C":
                cw.writerow(line)

如果需要的话，你可以使用Pandas读取现在已经筛选的 output_path CSV文件。

英文:

I think you'll be happier doing this streaming preprocessing without Pandas at all, with just the built-in csv module.

import csv

output_path = r&quot;C:\myfolder\test.csv&quot;
input_path = r&quot;C:\myfolder\large_file.csv&quot;

with open(output_path, &quot;w&quot;, encoding=&quot;utf-8&quot;) as out_f:
    cw = csv.writer(out_f, delimiter=&quot;;&quot;, quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding=&quot;utf-8&quot;) as in_f:
        cr = csv.reader(in_f, delimiter=&quot;;&quot;)
        cw.writerow(next(cr))  # copy header
        for line in cr:  # `line` is a tuple of strings
            if line[2] == &quot;B&quot; and line[1][:2] == &quot;5C&quot;:
                cw.writerow(line)

You can then read the now-subset output_path CSV file with Pandas if you need to.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?

问题

答案1

我对浅复制和深复制中的对象引用方式感到困惑。

你的堆栈为什么会充满了0xc2指令，当我传递了0x90指令？

使用`map`而不是`for`循环，通过迭代列表长度并使用列表值。

用Python Pillow库裁剪/模糊.png图像，而不改变其他任何内容。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论