2023年2月27日 19:04:51go评论100阅读模式

英文:

Multiprocessing won't run

问题

我正在尝试运行这个多进程池，但无法弄清楚为什么它不会运行。它似乎只是无休止地处理。我确信我调用的函数是有效的（我已经在没有池的情况下进行了测试），所以错误似乎出现在这里。有什么想法吗？根据打印输出，代码运行到了for循环部分。

它调用的函数运行rasterstats.zonal_stats，如果这很重要。

if __name__ == "__main__":
    # 创建并配置进程池
    print('inside', flush=True)
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        print('inside2', flush=True)
        # 准备函数的参数作为变量和波段的列表
        items = [(var, band) for var in clim_rasts.keys() for band in bands]
        print('inside3', flush=True)
        # 连接结果
        stime2 = time.time()
        for result in pool.starmap(main_climate_task, items):
            print('result', result, flush=True)
            climate_data = pd.concat([climate_data, result])
        etime2 = time.time()
        dur2 = etime2 - stime2
        print(dur2, flush=True)

英文:

I'm trying to run this multiprocessing pool and I can't figure out why it won't run. It just seems to be processing endlessly. I am confident the function I am calling works (I have tested without the pool) so the error seems to be here. Any thoughts? The code runs as far as the for loop based on what prints.

The function it is calling runs rasterstats.zonal_stats if that matters.

if __name__ == &quot;__main__&quot;:
    #create and configure the process pool
    print(&#39;inside&#39;, flush=True)
    with multiprocessing.Pool(processes = multiprocessing.cpu_count()) as pool:
        print(&#39;inside2&#39;, flush=True)
        #prepare arguments for function as list of variables and bands
        items = [(var, band) for var in clim_rasts.keys() for band in bands]
        print(&#39;inside3&#39;, flush=True)
        #concat the results
        stime2 = time.time()
        for result in pool.starmap(main_climate_task, items):
            print(&#39;result&#39;, result, flush=True)
            climate_data = pd.concat([climate_data, result]) )
        etime2 = time.time()
        dur2 = etime2-stime2
        print(dur2, flush=True)

答案1

得分: 1

你考虑过使用imap吗？

只有当函数有多个参数时，才需要使用 starmap。但你只有一个参数，这个参数恰好是一个元组。所以你可以使用非星号方法。

你考虑过使用imap_unordered吗？

这允许每个结果立即输出，而不必等待按正确顺序输出。

尝试更改

for result in pool.starmap(main_climate_task, items):
    print('result', result, flush=True)
    climate_data = pd.concat([climate_data, result])

为

for result in pool.imap_unordered(main_climate_task, items):
    print('result', result, flush=True)
    climate_data = pd.concat([climate_data, result])

如果这至少开始输出结果，那是个好消息。

总的来说，我尽量在可能的情况下使用imap_unordered，因为它最大程度地利用了所有的核心。任何完成工作的核心都会立即得到下一个item，因为总是可以写入输出，而不必等待其他核心完成先前列出的项目。

如果结果的顺序很重要

正如 @9769953 所指出的，你可以使用 pool：

results = pool.map(main_climate_task, items) 
climate_data = pd.concat(results)

我有一种倾向，喜欢尽早看到结果输出。此外，你可能会遇到没有结果输出的难题，即你不确定是 (a) 你的某个项目未完成，(b) 你的某些项目未完成，还是 (c) 你的所有项目都未完成。imap_unordered 可以轻松区分这些可能性。

对于我的方法，如果你需要 climate_data 按 items 的顺序排列，可以在 items = [(var, band)... 中插入一个迭代器：

items = [(var, band, i_var, i_band) for var,i_var in enumerate(clim_rasts.keys()) for band,i_band in enumerate(bands)]

然后安排 main_climate_task 将 i_band 和 i_var 传回 result，这样你就可以按所需的顺序重新排列 result 条目。

英文:

Have you considered imap?

You only need starmap if you have multiple arguments to the function. But you have only a single argument, that happens to be a tuple. So you can use the non-star methods.

Have you considered imap_unordered?

That allows each result to feed out immediately, rather than wait to come out in the correct order.

Try changing

    for result in pool.starmap(main_climate_task, items):
        print(&#39;result&#39;, result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

    for result in pool.imap_unordered(main_climate_task, items):
        print(&#39;result&#39;, result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

If this at least starts to output results, that is good news.

In general I try to use imap_unordered whenever possible, because it makes maximal use of all your cores. Any core that finishes its work is immediately given the next item to work on, because it is always possible to write the output, rather than a core having to wait for the completion of earlier-listed items by other cores.

If ordering of results matters

As pointed out by @9769953, you can use pool:

results = pool.map(main_climate_task, items) 
climate_data = pd.concat(results)

I have a bias to using imap_unordered, because I like seeing the results coming out "as early as possible". Moreover you are having a puzzle of receiving no results, i.e. you are not sure if (a) one of your items is not finishing, (b) some of your items are not finishing, (c) none of your items are not finishing. The imap_unordered makes it easy to distinguish these possibilities.

For my approach there is a little work to do if you need climate_data to be in the order of items. For example, into the items = [(var, band)..., you could insert an iterator:

items = [(var, band, i_var, i_band) for var,i_var in enumerate(clim_rasts.keys()) for band,i_band in enumerate(bands)]

Then arrange for main_climate_task to pass back the i_band and i_var into result, so you can re-order the result entries into the desired order.

答案2

得分: 0

由于 @Eureka 指出环境可能是问题的原因，我们要特别感谢他。我在这里找到了 (https://stackoverflow.com/questions/48846085/python-multiprocessing-within-jupyter-notebook) 的信息，指出在 Windows 上，JupyterLab 中无法运行多进程。我将我的代码保存在一个 .py 文件中运行，然后它可以正常工作。

%%writefile multiprocess.py
(将所有的代码放在这里)
%run multiprocess.py

英文:

All due thanks to @Eureka for pointing out the environment may be the issue. I found out here (https://stackoverflow.com/questions/48846085/python-multiprocessing-within-jupyter-notebook) that multiprocessing will not run in JupyterLab on Windows. I saved my code in a .py file and ran that and it works fine.

%%writefile multiprocess.py
(ALL THE CODE GOES HERE)
%run multiprocess.py

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

多进程不会运行

问题

答案1

你考虑过使用imap吗？

你考虑过使用imap_unordered吗？

如果结果的顺序很重要

Have you considered imap?

Have you considered imap_unordered?

If ordering of results matters

答案2

从Pandas时间戳中获取日期的更清晰方法

安装pyodbc时出现错误，无法为pyodbc构建轮子。

如何在Langchain中流式传输Agent的响应？

Sort/reindex a pandas multiindex dataframe when level 1 indicies are different for each level 0 index. Only one level 0 group needs to be sorted

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。