2023年6月1日 02:04:27go评论109阅读模式

英文:

How to efficiently process and filter large CSV files in Python?

问题

我正在处理大型CSV文件（几个千兆字节大小），需要使用Python高效地处理和筛选数据。目标是根据特定条件提取特定行，并对所选数据进行计算。

我面临的问题是：

使用pandas或csv模块将整个CSV文件读入内存会导致高内存使用和处理速度较慢，这是由于文件大小引起的。
逐行迭代CSV文件非常耗时，尤其是在应用复杂的筛选条件时。
传统的加载整个文件然后使用pandas的DataFrame操作进行筛选的方法由于内存限制而不可行。

我尝试使用pandas的chunksize参数以较小的块读取CSV文件，但仍然会消耗大量内存，并且不能提供所需的性能改进。
我尝试使用Dask来并行处理大型CSV文件，但对于最佳方法和如何高效实现筛选操作还存在疑虑。
我考虑使用像SQLite这样的数据库管理系统，但不确定它是否适合这个特定任务。
我将非常感谢关于如何在Python中处理和高效处理大型CSV文件的见解、建议或替代方法，特别关注内存高效的筛选和提取相关数据的方法。

谢谢您的帮助和专业知识！

英文:

I am working with large CSV files (several gigabytes in size) and need to process and filter the data efficiently using Python. The goal is to extract specific rows based on certain conditions and perform calculations on the selected data.

Here's the problem I am facing:

Reading the entire CSV file into memory using pandas or csv module results in high memory usage and slow processing due to the file size.
Iterating through the CSV file line by line is time-consuming, especially when applying complex filtering conditions.
The traditional approach of loading the entire file and then filtering using pandas' DataFrame operations is not feasible due to memory constraints.

I attempted to use pandas' chunksize parameter to read the CSV file in smaller chunks, but it still consumes a significant amount of memory and doesn't provide the desired performance improvement.
I explored using Dask to parallelize the processing of large CSV files, but I'm uncertain about the optimal approach and how to efficiently implement filtering operations.
I considered using a database management system like SQLite, but I'm unsure if it would be a suitable solution for this particular task.
I would greatly appreciate any insights, suggestions, or alternative approaches on how to handle and efficiently process large CSV files in Python, specifically focusing on memory-efficient methods for filtering and extracting relevant data.

Thank you for your help and expertise!

答案1

得分: 1

Polars 是一种与 Pandas 替代品，具备更快的 CSV 解析器。由于 Polars 内部使用 Arrow，您可能会看到内存使用减少。

如果可以接受一次性转换为另一种格式，请尝试将 CSV 转换为 Parquet。Pandas 和 Polars 都可以直接读取它，而且比 CSV 快一到两个数量级。

英文:

Polars is a Pandas alternative that comes with its own, much faster, CSV parser. Because Polars internally uses Arrow, you might see a reduction in memory usage too.

If a one-off conversion to another format is acceptable, try converting the CSV to Parquet. Both Pandas and Polars can read this out of the box, and it's one to two orders of magnitude faster than CSV.

答案2

得分: 1

使用 Pandas 的 chunksize：

pd.read_csv(data, chunksize=1000)

使用 Dask 和 Pandas：

from dask import dataframe as dd
start = time.time()
dask_df = dd.read_csv('huge_data.csv')
end = time.time()
print("使用 Dask 读取 csv 文件：", (end-start), "秒")

英文:

There are 2 solutions that comes to my mind right now:

Use chunksize with Pandas

pd.read_csv(data, chunksize=1000)

Use Dask with Pandas

from dask import dataframe as dd
start = time.time()
dask_df = dd.read_csv(&#39;huge_data.csv&#39;)
end = time.time()
print(&quot;Read csv with dask: &quot;,(end-start),&quot;sec&quot;)
Read csv with dask:  0.07900428771972656 sec

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何高效处理和筛选大型CSV文件在Python中？

问题

答案1

答案2

IndexError: 在实现翻译的Transformer模型时，self中的索引超出范围

File path error while converting .md to .pdf 将 .md 转换为 .pdf 时出现文件路径错误。

Switching dataframe integers to string so i can add text like '$' and '5.78 / Million' into the dataframe

psycopg2.errors.UndefinedColumn 在尝试将数据插入到PostgreSQL数据库时发生

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。