如何高效处理和筛选大型CSV文件在Python中?

huangapple go评论74阅读模式
英文:

How to efficiently process and filter large CSV files in Python?

问题

我正在处理大型CSV文件(几个千兆字节大小),需要使用Python高效地处理和筛选数据。目标是根据特定条件提取特定行,并对所选数据进行计算。

我面临的问题是:

使用pandas或csv模块将整个CSV文件读入内存会导致高内存使用和处理速度较慢,这是由于文件大小引起的。
逐行迭代CSV文件非常耗时,尤其是在应用复杂的筛选条件时。
传统的加载整个文件然后使用pandas的DataFrame操作进行筛选的方法由于内存限制而不可行。

我尝试使用pandas的chunksize参数以较小的块读取CSV文件,但仍然会消耗大量内存,并且不能提供所需的性能改进。
我尝试使用Dask来并行处理大型CSV文件,但对于最佳方法和如何高效实现筛选操作还存在疑虑。
我考虑使用像SQLite这样的数据库管理系统,但不确定它是否适合这个特定任务。
我将非常感谢关于如何在Python中处理和高效处理大型CSV文件的见解、建议或替代方法,特别关注内存高效的筛选和提取相关数据的方法。

谢谢您的帮助和专业知识!

英文:

I am working with large CSV files (several gigabytes in size) and need to process and filter the data efficiently using Python. The goal is to extract specific rows based on certain conditions and perform calculations on the selected data.

Here's the problem I am facing:

Reading the entire CSV file into memory using pandas or csv module results in high memory usage and slow processing due to the file size.
Iterating through the CSV file line by line is time-consuming, especially when applying complex filtering conditions.
The traditional approach of loading the entire file and then filtering using pandas' DataFrame operations is not feasible due to memory constraints.

I attempted to use pandas' chunksize parameter to read the CSV file in smaller chunks, but it still consumes a significant amount of memory and doesn't provide the desired performance improvement.
I explored using Dask to parallelize the processing of large CSV files, but I'm uncertain about the optimal approach and how to efficiently implement filtering operations.
I considered using a database management system like SQLite, but I'm unsure if it would be a suitable solution for this particular task.
I would greatly appreciate any insights, suggestions, or alternative approaches on how to handle and efficiently process large CSV files in Python, specifically focusing on memory-efficient methods for filtering and extracting relevant data.

Thank you for your help and expertise!

答案1

得分: 1

Polars 是一种与 Pandas 替代品,具备更快的 CSV 解析器。由于 Polars 内部使用 Arrow,您可能会看到内存使用减少。

如果可以接受一次性转换为另一种格式,请尝试将 CSV 转换为 Parquet。Pandas 和 Polars 都可以直接读取它,而且比 CSV 快一到两个数量级。

英文:

Polars is a Pandas alternative that comes with its own, much faster, CSV parser. Because Polars internally uses Arrow, you might see a reduction in memory usage too.

If a one-off conversion to another format is acceptable, try converting the CSV to Parquet. Both Pandas and Polars can read this out of the box, and it's one to two orders of magnitude faster than CSV.

答案2

得分: 1

  1. 使用 Pandas 的 chunksize
pd.read_csv(data, chunksize=1000)
  1. 使用 Dask 和 Pandas:
from dask import dataframe as dd
start = time.time()
dask_df = dd.read_csv('huge_data.csv')
end = time.time()
print("使用 Dask 读取 csv 文件:", (end-start), "秒")
英文:

There are 2 solutions that comes to my mind right now:

  1. Use chunksize with Pandas

pd.read_csv(data, chunksize=1000)

  1. Use Dask with Pandas
from dask import dataframe as dd
start = time.time()
dask_df = dd.read_csv('huge_data.csv')
end = time.time()
print("Read csv with dask: ",(end-start),"sec")
Read csv with dask:  0.07900428771972656 sec

huangapple
  • 本文由 发表于 2023年6月1日 02:04:27
  • 转载请务必保留本文链接:https://go.coder-hub.com/76376219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定