在创建Dask中的子数据帧时减少任务完成数

huangapple go评论65阅读模式
英文:

Reducing tasks to complete when creating child dataframes in Dask

问题

我正在尝试理解在创建 Dask 中的子数据框时的“最佳实践”,这在 Dask 文档和最佳实践文章中找不到。

假设我有一个名为 df 的非常大的 Dask 数据框,它在其后有许多任务以创建它。然后我需要对其执行一些操作,我想将其存储在名为 child_df 的子数据框中,然后将 child_dfdf 进行连接。

连接完成后,我需要使用 .compute() 将数据取回到 pandas 并继续我的工作。

我相信 child_df 将会复制创建 df 所需的任务数量,因此我想知道是否有办法可以创建 child_df 而无需重新运行创建 df 的任务?我的想法是正确的,child_df 会增加一倍的工作吗?

这是我试图实现的内容的一个非常简化的视图,所以我理解我可以调用 df.compute() 然后在 child_df 上进行工作,但在我的情况下,这将不起作用,因为 df 无法适应内存并且在进一步处理中会进一步筛选。

希望这能说得清楚 在创建Dask中的子数据帧时减少任务完成数

英文:

I am trying to understand the "best practice" when creating child dataframes in dask in which I haven't found within the dask documentation and best practice articles.

Let's say I have a really big dask dataframe called df which has many tasks behind it in order to create it. Then I need to perform some operation on it which I want to store in a child dataframe called child_df then join child_df back to df.

When the join has completed I then need to use .compute() to get the data back into pandas and carry on my work.

I believe that child_df will be duplicating the amount of tasks that it takes to create df and thus I am wondering is there a way I can create child_df without rerunning the tasks that create df? Is my thinking correct that child_df doubles the work?

This is a very simplified view of what I am trying to achieve so I understand I could call df.compute() then work off that on the child_df but in my case that will not work due to df not being able to fit in memory and being filtered down further on in the process.

Hope this makes sense 在创建Dask中的子数据帧时减少任务完成数

在创建Dask中的子数据帧时减少任务完成数

答案1

得分: 1

不,您不会复制这些任务。如果您单独计算它,"child"确实需要所有的上游任务,但当您将其与"parent"数据框联合时,dask会使用与每个操作和参数组合相关的唯一键,只计算每个中间结果一次,并根据需要多次使用它。

(在某些情况下,您可能会真正获得一些重复,事实上,您可能希望这样做,例如,如果您的某个工作线程变得比其他工作线程慢,这种技术可以提高性能和并行性,但这种情况相对较少发生,如果您的系统受到内存压力,根本不会发生)

英文:

No, you will not be duplicating the tasks. The "child" does indeed need all the upstream tasks if you were to compute it alone, but when you join it back with the "parent" dataframe, dask uses the unique keys associated with every operation and combination of arguments to only calculate each intermediate result only once and use it as many times as necessary.

(In come cases, you may genuinely get some duplication and in fact want this to be the case, should for example one of your workers become slower than the others. This technique to improve performance and parallelism happens relatively rarely and not at all if you are pressured for memory in the system)

huangapple
  • 本文由 发表于 2023年4月19日 18:18:53
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053344.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定