问题

我有一个非常消耗CPU资源的进程，并希望在Dask中使用尽可能多的工作进程。

当我使用dask的read_csv读取CSV文件，然后使用map_partitions处理数据框时，只使用一个工作进程。如果我使用pandas的read_csv然后将文件转换为Dask数据框，那么所有可用的工作进程都会被使用。请参见下面的代码。

有人能解释行为差异吗？

理想情况下，我希望使用Dask的read_csv，这样我就不必进行转换步骤。有人能帮助我吗？

import dask as d
import pandas as pd

def fWrapper(x):
    p = doSomething(x.ADDRESS, param)
    return(pd.DataFrame(p, columns=["ADDRESS", "DATA", "TOKEN", "CLASS"]))

# 只使用1个工作进程，而不是可用的8个
dask_df = d.dataframe('path\to\file')
dask_df.set_index(UID, npartitions=8, drop=False)
ddf2 = dask_df.map_partitions(fWrapper, meta={"ADDRESS": object, "DATA": object, "TOKEN": object, "CLASS": object}).compute()

# 使用所有8个工作进程
df = pd.read_csv('path\to\file')
df.set_index('UID', drop=False)
dask_df2 = d.dataframe.from_pandas(df, npartitions=dask_params['df_npartitions'], sort=True)
ddf3 = dask_df2.map_partitions(fWrapper, meta={"ADDRESS": object, "DATA": object, "TOKEN": object, "CLASS": object}).compute()

英文:

I have very CPU heavy process and would like to use as many workers are possible in Dask.

When I read the csv file using the read_csv from dask and then process the dataframe using map_partitions only one worker is used. If I use read_csv from pandas and then convert the file to a Dask dataframe, all my workers are used. See code below.

Could someone explain the difference in behavior?

Ideally, I would like to use read_csv from Dask so that I dont have to have a conversion step. Could anyone help me with that?

import dask as d
import pandas as pd

def fWrapper(x):
            p = doSomething(x.ADDRESS, param)
            return(pd.DataFrame(p, columns=[&quot;ADDRESS&quot;, &quot;DATA&quot;,&quot;TOKEN&quot;, &quot;CLASS&quot;]))

# only use 1 worker instead of the available 8
dask_df = d.dataframe(&#39;path\to\file&#39;)
dask_df.set_index(UID, npartitions = 8,   drop = False)
ddf2 = dask_df.map_partitions(fWrapper, meta={&quot;ADDRESS&quot; : object, &quot;DATA&quot; : object, &quot;TOKEN&quot; : object, &quot;CLASS&quot; : object}).compute() 

#uses all 8 workers
df = pd.read_csv(&#39;path\to\file&#39;)
df.set_index(&#39;UID&#39;, drop=False)
dask_df2 =d.dataframe.from_pandas(df, npartitions=dask_params[&#39;df_npartitions&#39;], sort=True)
ddf3 = dask_df2.map_partitions(fWrapper, meta={&quot;ADDRESS&quot; : object, &quot;DATA&quot; : object, &quot;TOKEN&quot; : object, &quot;CLASS&quot; : object}).compute()

答案1

得分: 0

DataFrame.set_index 方法在 dask.dataframe 和 pandas 中都返回更新后的数据框，因此必须将其分配给一个标签。pandas 有一个方便的关键字参数 inplace，但在 dask 中不可用。这意味着在您的代码片段中，第一种方法应该如下所示：

dask_df = dask_df.set_index(UID, npartitions=8, drop=False)

这将确保新索引的 dask 数据框具有 8 个分区，因此下游工作应该分配给多个工作器。

英文:

The DataFrame.set_index method in both dask.dataframe and pandas returns the updated dataframe, so it must be assigned to a label. pandas does have a convenience kwarg inplace, but that's not available in dask. This means that in your snippet, the first approach should look like this:

dask_df = dask_df.set_index(UID, npartitions = 8,   drop = False)

This will make sure that the new indexed dask dataframe has 8 partitions, so downstream work should be allocated across multiple workers.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Dask的`map_partition`在客户端上未使用所有的工作节点。

问题

答案1

在C中的for循环随机中断。

在Python中引用F-string中的字符串值。

在Python中格式化一个字符串，使其具有与另一个字符串相同的空格。

使用APScheduler多线程与SQLAlchemy一起工作

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论