问题

我正在尝试理解在创建 Dask 中的子数据框时的“最佳实践”，这在 Dask 文档和最佳实践文章中找不到。

假设我有一个名为 df 的非常大的 Dask 数据框，它在其后有许多任务以创建它。然后我需要对其执行一些操作，我想将其存储在名为 child_df 的子数据框中，然后将 child_df 与 df 进行连接。

连接完成后，我需要使用 .compute() 将数据取回到 pandas 并继续我的工作。

我相信 child_df 将会复制创建 df 所需的任务数量，因此我想知道是否有办法可以创建 child_df 而无需重新运行创建 df 的任务？我的想法是正确的，child_df 会增加一倍的工作吗？

这是我试图实现的内容的一个非常简化的视图，所以我理解我可以调用 df.compute() 然后在 child_df 上进行工作，但在我的情况下，这将不起作用，因为 df 无法适应内存并且在进一步处理中会进一步筛选。

希望这能说得清楚

英文:

I am trying to understand the "best practice" when creating child dataframes in dask in which I haven't found within the dask documentation and best practice articles.

Let's say I have a really big dask dataframe called df which has many tasks behind it in order to create it. Then I need to perform some operation on it which I want to store in a child dataframe called child_df then join child_df back to df.

When the join has completed I then need to use .compute() to get the data back into pandas and carry on my work.

I believe that child_df will be duplicating the amount of tasks that it takes to create df and thus I am wondering is there a way I can create child_df without rerunning the tasks that create df? Is my thinking correct that child_df doubles the work?

This is a very simplified view of what I am trying to achieve so I understand I could call df.compute() then work off that on the child_df but in my case that will not work due to df not being able to fit in memory and being filtered down further on in the process.

Hope this makes sense

答案1

得分: 1

不，您不会复制这些任务。如果您单独计算它，"child"确实需要所有的上游任务，但当您将其与"parent"数据框联合时，dask会使用与每个操作和参数组合相关的唯一键，只计算每个中间结果一次，并根据需要多次使用它。

（在某些情况下，您可能会真正获得一些重复，事实上，您可能希望这样做，例如，如果您的某个工作线程变得比其他工作线程慢，这种技术可以提高性能和并行性，但这种情况相对较少发生，如果您的系统受到内存压力，根本不会发生）

英文:

No, you will not be duplicating the tasks. The "child" does indeed need all the upstream tasks if you were to compute it alone, but when you join it back with the "parent" dataframe, dask uses the unique keys associated with every operation and combination of arguments to only calculate each intermediate result only once and use it as many times as necessary.

(In come cases, you may genuinely get some duplication and in fact want this to be the case, should for example one of your workers become slower than the others. This technique to improve performance and parallelism happens relatively rarely and not at all if you are pressured for memory in the system)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在创建Dask中的子数据帧时减少任务完成数

问题

答案1

创建一个用于我的Python游戏的JSON保存系统。

Polars相对于{data.table}的内存使用情况

使用FormRequest通过HTTP POST提取数据。

提取月份和实际年份从月份期间格式。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论