Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?

huangapple go评论75阅读模式
英文:

Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?

问题

我有一个大型的CSV文件,想要在pandas中将其读入数据框并执行操作。不幸的是,Dask不是一个选项。因此,我会按块处理CSV。以下是一个使用块大小为5000000的示例:

import pandas as pd
import csv
import os

dtypes = {"Column1": str, "Column2": str, "Column3": str, "Column4": str}

output_path = r'C:\myfolder\test.csv'

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding='utf-8',
                    index=False,
                    sep=';',
                    decimal=',',
                    date_format='%d.%m.%Y',
                    quoting=csv.QUOTE_MINIMAL)

这需要一些时间,但可以正常工作。

现在,我想选择一个子集并将此子集导出到一个新的CSV文件。假设我的原始文件如下:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10

现在,我想选择那些在Column2的前两个字符为"5C"且Column3等于"B"的记录:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10

首先,我尝试了一个条件:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[chunk['Column3'] == 'B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)

这个方法可以正常工作(使用相同的块大小5000000)。

然而,现在我有多个条件。开始实现第二个条件时,我尝试了以下代码:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)

现在遇到了一个问题:它抛出了内存错误(MemoryError: Unable to allocate...)。错误不会在第一个块出现,但在后面出现。似乎这需要比只有一个条件更多的RAM计算资源。我有点困惑为什么前面的块没有问题,而在后面的块(更接近文件末尾)会出现内存错误。看起来这些块中的数据更符合筛选条件,因此存储和操作的数据更多。

因此,我将块大小减小到2000000,同样的内存错误。我尝试了1000000,仍然出现内存错误。只有500000时才能正常工作。现在我不确定是否“自然”需要降低块大小,因为使用更多条件会增加计算资源的需求,还是我的连接条件的方式 - 编程/实现 - 这一行:filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]是否低效?所以是否有更高效的方法来选择数据并输出它?在这个示例中,我使用了2个条件,但我将有更多,最多6个。我将继续使用and或or运算符将它们组合到filtered行中。

英文:

I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:

import pandas as pd
import csv
import os

dtypes= {     "Column1": str, "Column2": str, "Column3": str, "Column4": str,
        }

output_path=r'C:\myfolder\test.csv'

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

It takes a while, but works.

Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10

Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B":
The output should be:

Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10

Starting with one criteria I tried first:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[chunk['Column3']=='B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

and it worked (with the same chunk size of 5000000).

However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:

with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)

And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.

I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")] is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.

答案1

得分: 1

我认为你完全可以在不使用Pandas的情况下进行流处理预处理,只需使用内置的csv模块。

import csv

output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"

with open(output_path, "w", encoding="utf-8") as out_f:
    cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw.writerow(next(cr))  # 复制标题
        for line in cr:  # `line` 是一个字符串元组
            if line[2] == "B" and line[1][:2] == "5C":
                cw.writerow(line)

如果需要的话,你可以使用Pandas读取现在已经筛选的 output_path CSV文件。

英文:

I think you'll be happier doing this streaming preprocessing without Pandas at all, with just the built-in csv module.

import csv

output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"

with open(output_path, "w", encoding="utf-8") as out_f:
    cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw.writerow(next(cr))  # copy header
        for line in cr:  # `line` is a tuple of strings
            if line[2] == "B" and line[1][:2] == "5C":
                cw.writerow(line)

You can then read the now-subset output_path CSV file with Pandas if you need to.

huangapple
  • 本文由 发表于 2023年1月9日 18:17:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/75055797.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定