英文:
Reading and performing data filtering with pandas chunks: Efficient filtering on chunks with multiple criteria avoiding memoryerror?
问题
我有一个大型的CSV文件,想要在pandas中将其读入数据框并执行操作。不幸的是,Dask不是一个选项。因此,我会按块处理CSV。以下是一个使用块大小为5000000的示例:
import pandas as pd
import csv
import os
dtypes = {"Column1": str, "Column2": str, "Column3": str, "Column4": str}
output_path = r'C:\myfolder\test.csv'
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
for chunk in reader:
chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding='utf-8',
index=False,
sep=';',
decimal=',',
date_format='%d.%m.%Y',
quoting=csv.QUOTE_MINIMAL)
这需要一些时间,但可以正常工作。
现在,我想选择一个子集并将此子集导出到一个新的CSV文件。假设我的原始文件如下:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10
现在,我想选择那些在Column2的前两个字符为"5C"且Column3等于"B"的记录:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10
首先,我尝试了一个条件:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[chunk['Column3'] == 'B']
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding='utf-8',
index=False,
sep=';',
decimal=',',
date_format='%d.%m.%Y',
quoting=csv.QUOTE_MINIMAL)
这个方法可以正常工作(使用相同的块大小5000000)。
然而,现在我有多个条件。开始实现第二个条件时,我尝试了以下代码:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=';', encoding='utf-8', dtype=dtypes, decimal=',', chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding='utf-8',
index=False,
sep=';',
decimal=',',
date_format='%d.%m.%Y',
quoting=csv.QUOTE_MINIMAL)
现在遇到了一个问题:它抛出了内存错误(MemoryError: Unable to allocate...
)。错误不会在第一个块出现,但在后面出现。似乎这需要比只有一个条件更多的RAM计算资源。我有点困惑为什么前面的块没有问题,而在后面的块(更接近文件末尾)会出现内存错误。看起来这些块中的数据更符合筛选条件,因此存储和操作的数据更多。
因此,我将块大小减小到2000000,同样的内存错误。我尝试了1000000,仍然出现内存错误。只有500000时才能正常工作。现在我不确定是否“自然”需要降低块大小,因为使用更多条件会增加计算资源的需求,还是我的连接条件的方式 - 编程/实现 - 这一行:filtered = chunk[(chunk['Column3'] == 'B') & (chunk['Column2'].str[:2] == "5C")]
是否低效?所以是否有更高效的方法来选择数据并输出它?在这个示例中,我使用了2个条件,但我将有更多,最多6个。我将继续使用and或or运算符将它们组合到filtered行中。
英文:
I have a large csv file and want to read into a dataframe in pandas and perform operations. I need to do it in pandas, dask is not an option unfortunately. Therefore I process the csv in chunks. This works for example with a chunk size of 5000000:
import pandas as pd
import csv
import os
dtypes= { "Column1": str, "Column2": str, "Column3": str, "Column4": str,
}
output_path=r'C:\myfolder\test.csv'
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
chunk.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
It takes a while, but works.
Now, I want to select a subset and export this subset to a new csv file. Suppose my original file looks like this:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BD;4L432;B;6A60
CD;5CDES;C;6A40
BF;5CLES;B;6A10
Now I want to subselect those records, there the first two characters in column2 are "5C" and Column3 is equal to "B":
The output should be:
Column1;Column2;Column3;Column4;
AB;5CDES;B;6A70
BF;5CLES;B;6A10
Starting with one criteria I tried first:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[chunk['Column3']=='B']
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
and it worked (with the same chunk size of 5000000).
However, I have now several criteria, so more than one. Starting to implement the second criteria I tried:
with pd.read_csv(r'C:\myfolder\large_file.csv', sep=";", encoding="utf-8", dtype=dtypes, decimal=",", chunksize=5000000) as reader:
for chunk in reader:
filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
filtered.to_csv(output_path, mode='a', header=not os.path.exists(output_path),
encoding="utf-8",
index=False,
sep=";",
decimal=",",
date_format="%d.%m.%Y",
quoting=csv.QUOTE_MINIMAL)
And now comes the problem: It throws a memory error (MemoryError: Unable to allocate...
). The error does not occur with the first chunk, but later on. It seems that this needs way more RAM computation resources than just having one criteria. I was a bit confused why the first chunks are done without any problems and at the later chunks (more to the end) a memoryerror is thrown. Seems like these chunks have more data in it that comply with the filter criteria, so more data is stored and operated on.
I therefore reduced the chunk size to 2000000, same memory error again. I tried 1000000 and still a memory error. With 500000 it worked. Now I was not sure if this is "natural" that I have to go down with the chunk size, because with more criteria the computational resources increase, or if my way of connecting the criteria - the programming/implementation - this line: filtered = chunk[(chunk['Column3']=='B') & (chunk['Column2'].str[:2]=="5C")]
is inefficient? So is there a more efficient way to subselect the data and output it? In this example I used 2 criteria, but I will have more, like up to 6. I would continue combining them with and or or operators to the filtered line.
答案1
得分: 1
我认为你完全可以在不使用Pandas的情况下进行流处理预处理,只需使用内置的csv
模块。
import csv
output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"
with open(output_path, "w", encoding="utf-8") as out_f:
cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
with open(input_path, encoding="utf-8") as in_f:
cr = csv.reader(in_f, delimiter=";")
cw.writerow(next(cr)) # 复制标题
for line in cr: # `line` 是一个字符串元组
if line[2] == "B" and line[1][:2] == "5C":
cw.writerow(line)
如果需要的话,你可以使用Pandas读取现在已经筛选的 output_path
CSV文件。
英文:
I think you'll be happier doing this streaming preprocessing without Pandas at all, with just the built-in csv
module.
import csv
output_path = r"C:\myfolder\test.csv"
input_path = r"C:\myfolder\large_file.csv"
with open(output_path, "w", encoding="utf-8") as out_f:
cw = csv.writer(out_f, delimiter=";", quoting=csv.QUOTE_MINIMAL)
with open(input_path, encoding="utf-8") as in_f:
cr = csv.reader(in_f, delimiter=";")
cw.writerow(next(cr)) # copy header
for line in cr: # `line` is a tuple of strings
if line[2] == "B" and line[1][:2] == "5C":
cw.writerow(line)
You can then read the now-subset output_path
CSV file with Pandas if you need to.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论