问题

我正在处理一个大型数据集（超过10GB），并且我目前使用Pandas的方法会导致内存问题。我听说Dask可以更高效地处理更大的数据集，但我不确定如何入门以及需要注意什么。

这是我在Pandas中正在做的一个示例：

import pandas as pd

df = pd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

这对较小的数据集有效，但对于我的10GB数据集，我遇到了MemoryError。

我已经研究了Dask并尝试了以下代码：

import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

这不会导致MemoryError，但也没有给我预期的结果。我不确定我漏掉了什么。

在处理大型数据集时，Pandas和Dask之间的主要区别是什么？
我应该对我的Dask代码进行哪些修改以获得与Pandas代码相同的结果？

英文:

I'm working with a large set of data(over 10GB), and my current approach with Pandas is causing memory issues. I've heard that Dask can handle larger datasets more efficiently, but I'm not sure how to get started with it or what to watch out for.

Here's an exsample of what I'm doing with Pandas:

python
Copy code
import pandas as pd

df = pd.read_csv(&#39;large_dataset.csv&#39;)
df[&#39;new_column&#39;] = df[&#39;column1&#39;] + df[&#39;column2&#39;]
df.to_csv(&#39;updated_dataset.csv&#39;)

This works fine with smaller datasets, but with my 10GB dataset, I'm getting a MemoryError.

I've looked into Dask and tried

python
Copy code
import dask.dataframe as dd

df = dd.read_csv(&#39;large_dataset.csv&#39;)
df[&#39;new_column&#39;] = df[&#39;column1&#39;] + df[&#39;column2&#39;]
df.to_csv(&#39;updated_dataset.csv&#39;)

This doesn't give a MemoryError, but it's also not giving me the expected results. I'm not sure what I'm missing.

What are the key differences in handling large datasets between Pandas and Dask?
What modifications should I make to my Dask code to get the same results as my Pandas code?

答案1

得分: 2

默认情况下，dask.dataframe 将按照每个分区写入一个文件。如果您希望得到与 pandas 相同的输出，则相关的关键字参数是 single_file，应设置为 True：

df.to_csv('updated_dataset.csv', single_file=True)

英文:

By default dask.dataframe will write one file per partition. If you are expecting the same output as pandas, then a relevant kwarg is single_file which should be set to True:

df.to_csv(&#39;updated_dataset.csv&#39;, single_file=True)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Handling Large Datasets Efficiently in Python: Pandas vs. Dask

问题

答案1

为什么我的使用NumPy数组的排序算法比使用列表慢？

如何使用snscrape基于特定查询来抓取Twitter用户

数据表下拉选项更新回调输出函数

Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> print(name + age) TypeError: can only concatenate str (not "int") to str

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论