多进程不会运行

huangapple go评论70阅读模式
英文:

Multiprocessing won't run

问题

我正在尝试运行这个多进程池,但无法弄清楚为什么它不会运行。它似乎只是无休止地处理。我确信我调用的函数是有效的(我已经在没有池的情况下进行了测试),所以错误似乎出现在这里。有什么想法吗?根据打印输出,代码运行到了for循环部分。

它调用的函数运行rasterstats.zonal_stats,如果这很重要。

if __name__ == "__main__":
    # 创建并配置进程池
    print('inside', flush=True)
    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        print('inside2', flush=True)
        # 准备函数的参数作为变量和波段的列表
        items = [(var, band) for var in clim_rasts.keys() for band in bands]
        print('inside3', flush=True)
        # 连接结果
        stime2 = time.time()
        for result in pool.starmap(main_climate_task, items):
            print('result', result, flush=True)
            climate_data = pd.concat([climate_data, result])
        etime2 = time.time()
        dur2 = etime2 - stime2
        print(dur2, flush=True)
英文:

I'm trying to run this multiprocessing pool and I can't figure out why it won't run. It just seems to be processing endlessly. I am confident the function I am calling works (I have tested without the pool) so the error seems to be here. Any thoughts? The code runs as far as the for loop based on what prints.

The function it is calling runs rasterstats.zonal_stats if that matters.

if __name__ == "__main__":
    #create and configure the process pool
    print('inside', flush=True)
    with multiprocessing.Pool(processes = multiprocessing.cpu_count()) as pool:
        print('inside2', flush=True)
        #prepare arguments for function as list of variables and bands
        items = [(var, band) for var in clim_rasts.keys() for band in bands]
        print('inside3', flush=True)
        #concat the results
        stime2 = time.time()
        for result in pool.starmap(main_climate_task, items):
            print('result', result, flush=True)
            climate_data = pd.concat([climate_data, result]) )
        etime2 = time.time()
        dur2 = etime2-stime2
        print(dur2, flush=True)

答案1

得分: 1

你考虑过使用imap吗?

只有当函数有多个参数时,才需要使用 starmap。但你只有一个参数,这个参数恰好是一个元组。所以你可以使用非星号方法。

你考虑过使用imap_unordered吗?

这允许每个结果立即输出,而不必等待按正确顺序输出。

尝试更改

for result in pool.starmap(main_climate_task, items):
    print('result', result, flush=True)
    climate_data = pd.concat([climate_data, result])

for result in pool.imap_unordered(main_climate_task, items):
    print('result', result, flush=True)
    climate_data = pd.concat([climate_data, result])

如果这至少开始输出结果,那是个好消息。

总的来说,我尽量在可能的情况下使用imap_unordered,因为它最大程度地利用了所有的核心。任何完成工作的核心都会立即得到下一个item,因为总是可以写入输出,而不必等待其他核心完成先前列出的项目。

如果结果的顺序很重要

正如 @9769953 所指出的,你可以使用 pool

results = pool.map(main_climate_task, items) 
climate_data = pd.concat(results)

我有一种倾向,喜欢尽早看到结果输出。此外,你可能会遇到没有结果输出的难题,即你不确定是 (a) 你的某个项目未完成,(b) 你的某些项目未完成,还是 (c) 你的所有项目都未完成。imap_unordered 可以轻松区分这些可能性。

对于我的方法,如果你需要 climate_dataitems 的顺序排列,可以在 items = [(var, band)... 中插入一个迭代器:

items = [(var, band, i_var, i_band) for var,i_var in enumerate(clim_rasts.keys()) for band,i_band in enumerate(bands)]

然后安排 main_climate_taski_bandi_var 传回 result,这样你就可以按所需的顺序重新排列 result 条目。

英文:

Have you considered imap?

You only need starmap if you have multiple arguments to the function. But you have only a single argument, that happens to be a tuple. So you can use the non-star methods.

Have you considered imap_unordered?

That allows each result to feed out immediately, rather than wait to come out in the correct order.

Try changing

    for result in pool.starmap(main_climate_task, items):
        print('result', result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

To

    for result in pool.imap_unordered(main_climate_task, items):
        print('result', result, flush=True)
        climate_data = pd.concat([climate_data, result]) )

If this at least starts to output results, that is good news.

In general I try to use imap_unordered whenever possible, because it makes maximal use of all your cores. Any core that finishes its work is immediately given the next item to work on, because it is always possible to write the output, rather than a core having to wait for the completion of earlier-listed items by other cores.

If ordering of results matters

As pointed out by @9769953, you can use pool:

results = pool.map(main_climate_task, items) 
climate_data = pd.concat(results)

I have a bias to using imap_unordered, because I like seeing the results coming out "as early as possible". Moreover you are having a puzzle of receiving no results, i.e. you are not sure if (a) one of your items is not finishing, (b) some of your items are not finishing, (c) none of your items are not finishing. The imap_unordered makes it easy to distinguish these possibilities.

For my approach there is a little work to do if you need climate_data to be in the order of items. For example, into the items = [(var, band)..., you could insert an iterator:

items = [(var, band, i_var, i_band) for var,i_var in enumerate(clim_rasts.keys()) for band,i_band in enumerate(bands)]

Then arrange for main_climate_task to pass back the i_band and i_var into result, so you can re-order the result entries into the desired order.

答案2

得分: 0

由于 @Eureka 指出环境可能是问题的原因,我们要特别感谢他。我在这里找到了 (https://stackoverflow.com/questions/48846085/python-multiprocessing-within-jupyter-notebook) 的信息,指出在 Windows 上,JupyterLab 中无法运行多进程。我将我的代码保存在一个 .py 文件中运行,然后它可以正常工作。

%%writefile multiprocess.py

(将所有的代码放在这里)

%run multiprocess.py
英文:

All due thanks to @Eureka for pointing out the environment may be the issue. I found out here (https://stackoverflow.com/questions/48846085/python-multiprocessing-within-jupyter-notebook) that multiprocessing will not run in JupyterLab on Windows. I saved my code in a .py file and ran that and it works fine.

%%writefile multiprocess.py

(ALL THE CODE GOES HERE)

%run multiprocess.py

huangapple
  • 本文由 发表于 2023年2月27日 19:04:51
  • 转载请务必保留本文链接:https://go.coder-hub.com/75579643.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定