英文:
Reducing tasks to complete when creating child dataframes in Dask
问题
我正在尝试理解在创建 Dask 中的子数据框时的“最佳实践”,这在 Dask 文档和最佳实践文章中找不到。
假设我有一个名为 df
的非常大的 Dask 数据框,它在其后有许多任务以创建它。然后我需要对其执行一些操作,我想将其存储在名为 child_df
的子数据框中,然后将 child_df
与 df
进行连接。
连接完成后,我需要使用 .compute()
将数据取回到 pandas 并继续我的工作。
我相信 child_df
将会复制创建 df
所需的任务数量,因此我想知道是否有办法可以创建 child_df
而无需重新运行创建 df
的任务?我的想法是正确的,child_df
会增加一倍的工作吗?
这是我试图实现的内容的一个非常简化的视图,所以我理解我可以调用 df.compute()
然后在 child_df
上进行工作,但在我的情况下,这将不起作用,因为 df
无法适应内存并且在进一步处理中会进一步筛选。
希望这能说得清楚
英文:
I am trying to understand the "best practice" when creating child dataframes in dask in which I haven't found within the dask documentation and best practice articles.
Let's say I have a really big dask dataframe called df
which has many tasks behind it in order to create it. Then I need to perform some operation on it which I want to store in a child dataframe called child_df
then join child_df
back to df
.
When the join has completed I then need to use .compute()
to get the data back into pandas and carry on my work.
I believe that child_df
will be duplicating the amount of tasks that it takes to create df
and thus I am wondering is there a way I can create child_df
without rerunning the tasks that create df
? Is my thinking correct that child_df
doubles the work?
This is a very simplified view of what I am trying to achieve so I understand I could call df.compute()
then work off that on the child_df
but in my case that will not work due to df
not being able to fit in memory and being filtered down further on in the process.
Hope this makes sense
答案1
得分: 1
不,您不会复制这些任务。如果您单独计算它,"child"确实需要所有的上游任务,但当您将其与"parent"数据框联合时,dask会使用与每个操作和参数组合相关的唯一键,只计算每个中间结果一次,并根据需要多次使用它。
(在某些情况下,您可能会真正获得一些重复,事实上,您可能希望这样做,例如,如果您的某个工作线程变得比其他工作线程慢,这种技术可以提高性能和并行性,但这种情况相对较少发生,如果您的系统受到内存压力,根本不会发生)
英文:
No, you will not be duplicating the tasks. The "child" does indeed need all the upstream tasks if you were to compute it alone, but when you join it back with the "parent" dataframe, dask uses the unique keys associated with every operation and combination of arguments to only calculate each intermediate result only once and use it as many times as necessary.
(In come cases, you may genuinely get some duplication and in fact want this to be the case, should for example one of your workers become slower than the others. This technique to improve performance and parallelism happens relatively rarely and not at all if you are pressured for memory in the system)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论