2023年7月10日 17:53:59go评论64阅读模式

英文:

How to convert a large csv file (~1TB) into a polars dataframe in a shorter time?

问题

我有一个非常大的1TB大小的CSV文件。我想将其转换为Polars数据框，但生成数据框需要大约15分钟以上。

如何能更高效地完成这个任务？由于文件大小可能会更大，所以可能需要更长的时间。有没有一种有效的方法可以减少计算时间？

我尝试过在Polars读取CSV函数中使用n_threads参数，但效果不大，只能减少1分钟。

import polars as pl
csv = pl.read_csv("CSV文件路径", n_threads=4, ignore_errors=True)

英文:

I have a very large CSV file with size 1TB. I want to convert into a polars dataframe, but it takes > ~15mins to generate the dataframe.

How can I do this more efficiently? There are high chances that the size of the file will be bigger, so it might take longer. Is there any efficient way where I can reduce or shorten the computation time?

I have tried to use n_threads during polars read csv function, but it still doesnt help much, it managed to reduce by only 1 minute.

import polars as pl
csv = pl.read_csv(&quot;path to csv file&quot;, n_threads=4, ignore_errors = True)

答案1

得分: 4

Polars已经以最快的速度解析CSV文件。它已经使用了所有的线程，因此尝试启动单独的进程只会降低性能。

您可以设置rechunk=False。这将在读取后节省昂贵的重新分配。

您可以尝试的其他事情是将模式设置为较小的数据类型，这样您将具有更少的内存和缓存压力。

英文:

Polars already parses csv files as fast as it can. It already uses all threads, so trying to spin separate processes will only hurt performance.

You can set rechunk=False. That will save an expensive reallocation after reading.

Other things you can try is set the schema to smaller data types, so that you will have less memory and cache pressure.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在更短的时间内将一个大约1TB的CSV文件转换成Polars数据框？

问题

答案1

使用空体和不使用空体的for循环所花费的时间相同。

如何简化庞大的 switch-case 表达式？

MySQL查询随时间变慢。

基于另一个数据框的数值标记一个数据框的行。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论