如何在更短的时间内将一个大约1TB的CSV文件转换成Polars数据框?

huangapple go评论56阅读模式
英文:

How to convert a large csv file (~1TB) into a polars dataframe in a shorter time?

问题

我有一个非常大的1TB大小的CSV文件。我想将其转换为Polars数据框,但生成数据框需要大约15分钟以上。

如何能更高效地完成这个任务?由于文件大小可能会更大,所以可能需要更长的时间。有没有一种有效的方法可以减少计算时间?

我尝试过在Polars读取CSV函数中使用n_threads参数,但效果不大,只能减少1分钟。

import polars as pl
csv = pl.read_csv("CSV文件路径", n_threads=4, ignore_errors=True)
英文:

I have a very large CSV file with size 1TB. I want to convert into a polars dataframe, but it takes > ~15mins to generate the dataframe.

How can I do this more efficiently? There are high chances that the size of the file will be bigger, so it might take longer. Is there any efficient way where I can reduce or shorten the computation time?

I have tried to use n_threads during polars read csv function, but it still doesnt help much, it managed to reduce by only 1 minute.

import polars as pl
csv = pl.read_csv("path to csv file", n_threads=4, ignore_errors = True)

答案1

得分: 4

Polars已经以最快的速度解析CSV文件。它已经使用了所有的线程,因此尝试启动单独的进程只会降低性能。

您可以设置rechunk=False。这将在读取后节省昂贵的重新分配。

您可以尝试的其他事情是将模式设置为较小的数据类型,这样您将具有更少的内存和缓存压力。

英文:

Polars already parses csv files as fast as it can. It already uses all threads, so trying to spin separate processes will only hurt performance.

You can set rechunk=False. That will save an expensive reallocation after reading.

Other things you can try is set the schema to smaller data types, so that you will have less memory and cache pressure.

huangapple
  • 本文由 发表于 2023年7月10日 17:53:59
  • 转载请务必保留本文链接:https://go.coder-hub.com/76652611.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定