Handling Large Datasets Efficiently in Python: Pandas vs. Dask

huangapple go评论77阅读模式
英文:

Handling Large Datasets Efficiently in Python: Pandas vs. Dask

问题

我正在处理一个大型数据集(超过10GB),并且我目前使用Pandas的方法会导致内存问题。我听说Dask可以更高效地处理更大的数据集,但我不确定如何入门以及需要注意什么。

这是我在Pandas中正在做的一个示例:

import pandas as pd

df = pd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

这对较小的数据集有效,但对于我的10GB数据集,我遇到了MemoryError。

我已经研究了Dask并尝试了以下代码:

import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

这不会导致MemoryError,但也没有给我预期的结果。我不确定我漏掉了什么。

在处理大型数据集时,Pandas和Dask之间的主要区别是什么?
我应该对我的Dask代码进行哪些修改以获得与Pandas代码相同的结果?

英文:

I'm working with a large set of data(over 10GB), and my current approach with Pandas is causing memory issues. I've heard that Dask can handle larger datasets more efficiently, but I'm not sure how to get started with it or what to watch out for.

Here's an exsample of what I'm doing with Pandas:

python
Copy code
import pandas as pd

df = pd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

This works fine with smaller datasets, but with my 10GB dataset, I'm getting a MemoryError.

I've looked into Dask and tried

python
Copy code
import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')
df['new_column'] = df['column1'] + df['column2']
df.to_csv('updated_dataset.csv')

This doesn't give a MemoryError, but it's also not giving me the expected results. I'm not sure what I'm missing.

What are the key differences in handling large datasets between Pandas and Dask?
What modifications should I make to my Dask code to get the same results as my Pandas code?

答案1

得分: 2

默认情况下,dask.dataframe 将按照每个分区写入一个文件。如果您希望得到与 pandas 相同的输出,则相关的关键字参数是 single_file,应设置为 True:

df.to_csv('updated_dataset.csv', single_file=True)
英文:

By default dask.dataframe will write one file per partition. If you are expecting the same output as pandas, then a relevant kwarg is single_file which should be set to True:

df.to_csv('updated_dataset.csv', single_file=True)

huangapple
  • 本文由 发表于 2023年5月24日 21:14:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76323957.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定