英文:
Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?
问题
我有一个大型的CSV文件,想要在pandas中将其读入数据框并执行操作。不幸的是,Dask不是一个选项。因此,我会按块处理CSV。以下是一个使用块大小为5000000的示例:
import pandas as pd
import csv
import os
dtypes = {"Column1": str, "Column2": str, "Column3": str, "Column4": str}
output_path = r'C:\myfolder\test.csv'
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding='utf-8',
                    index=False,
                    sep=';',
                    decimal=',',
                    date_format='%d.%m.%Y',
                    quoting=csv.QUOTE_MINIMAL)
这需要一些时间,但可以正常工作。
现在,我想选择一个子集并将此子集导出到一个新的CSV文件。假设我的原始文件如下:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10
现在,我想选择那些在Column2的前两个字符为"5C"且Column3等于"B"的记录:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10
首先,我尝试了一个条件:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[chunk['Column3'] == 'B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)
这个方法可以正常工作(使用相同的块大小5000000)。
然而,现在我有多个条件。开始实现第二个条件时,我尝试了以下代码:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
    for chunk in reader:
        filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                        encoding='utf-8',
                        index=False,
                        sep=';',
                        decimal=',',
                        date_format='%d.%m.%Y',
                        quoting=csv.QUOTE_MINIMAL)
现在遇到了一个问题:它抛出了内存错误(MemoryError: Unable to allocate...)。错误不会在第一个块出现,但在后面出现。似乎这需要比只有一个条件更多的RAM计算资源。我有点困惑为什么前面的块没有问题,而在后面的块(更接近文件末尾)会出现内存错误。看起来这些块中的数据更符合筛选条件,因此存储和操作的数据更多。
因此,我将块大小减小到2000000,同样的内存错误。我尝试了1000000,仍然出现内存错误。只有500000时才能正常工作。现在我不确定是否“自然”需要降低块大小,因为使用更多条件会增加计算资源的需求,还是我的连接条件的方式 - 编程/实现 - 这一行:filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]是否低效?所以是否有更高效的方法来选择数据并输出它?在这个示例中,我使用了2个条件,但我将有更多,最多6个。我将继续使用and或or运算符将它们组合到filtered行中。
英文:
I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:
import pandas as pd
import csv
import os
dtypes= {     "Column1": str, "Column2": str, "Column3": str, "Column4": str,
        }
output_path=r'C:\myfolder\test.csv'
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)
It takes a while, but works.
Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10
Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B":
The output should be:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10
Starting with one criteria I tried first:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[chunk['Column3']=='B']
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)
and it worked (with the same chunk size of 5000000).
However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
    for chunk in reader: 
        filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
        filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
                    encoding="utf-8",
                    index=False,
                    sep=";",
                    decimal=",",
                    date_format="%d.%m.%Y",
                    quoting=csv.QUOTE_MINIMAL)
And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.
I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")] is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.
答案1
得分: 1
我认为你完全可以在不使用Pandas的情况下进行流处理预处理,只需使用内置的csv模块。
import csv
output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"
with open(output_path, "w", encoding="utf-8") as out_f:
    cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw.writerow(next(cr))  # 复制标题
        for line in cr:  # `line` 是一个字符串元组
            if line[2] == "B" and line[1][:2] == "5C":
                cw.writerow(line)
如果需要的话,你可以使用Pandas读取现在已经筛选的 output_path CSV文件。
英文:
I think you'll be happier doing this streaming preprocessing without Pandas at all, with  just the built-in csv module.
import csv
output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"
with open(output_path, "w", encoding="utf-8") as out_f:
    cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
    with open(input_path, encoding="utf-8") as in_f:
        cr = csv.reader(in_f, delimiter=";")
        cw.writerow(next(cr))  # copy header
        for line in cr:  # `line` is a tuple of strings
            if line[2] == "B" and line[1][:2] == "5C":
                cw.writerow(line)
You can then read the now-subset output_path CSV file with Pandas if you need to.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论