2023年2月14日 04:36:42go评论153阅读模式

英文:

Python multiprocessing: sharing global real-only large data without reloading from disk for child processes

问题

我需要从磁盘中读取大量数据并对其进行一些只读操作。

我需要使用多进程，但是使用multiprocessing.Manager()或Array()来在进程之间共享数据速度太慢了。由于我对这个大数据的操作是只读的，根据这个答案，我可以将这个大数据声明在全局作用域中，然后每个子进程在内存中都有自己的大数据：

# main.py
import argparse
import numpy as np
import multiprocessing as mp
import time

parser = argparse.ArgumentParser()
parser.add_argument('-p', '--path', type=str)
args = parser.parse_args()
print('从磁盘加载数据...可能需要很长时间...')

global_large_data = np.load(args.path)

def worker(row_id):
    # 对global_large_data进行只读操作
    time.sleep(0.01)
    print(row_id, np.sum(global_large_data[row_id]))

def main():
    pool = mp.Pool(mp.cpu_count())
    pool.map(worker, range(global_large_data.shape[0]))
    pool.close()
    pool.join()

if __name__ == '__main__':
    main()

然后在终端中运行：

$ python3 main.py -p /path/to/large_data.npy

这很快，对我来说几乎已经足够好了。然而，一个缺点是每个子进程都需要重新从磁盘加载大文件，加载过程可能会浪费很多时间。

是否有一种方法（例如包装器），可以让只有父进程加载一次磁盘上的文件，然后直接将副本发送到每个子进程的内存中？

请注意，我的内存空间充裕 - 内存中有许多此大数据的副本是可以的。我只是不想多次从磁盘重新加载它。

英文:

Say I need to read from disk a large data and do some read-only work on it.

I need to use multiprocessing, but to share it across processes using multiprocessing.Manager() or Array() is way too slow. Since my operation on this large data is read-only, according to this answer, I can declare this large data in the global scope, and then each child process has its own large data in the memory:

# main.py
import argparse
import numpy as np
import multiprocessing as mp
import time

parser = argparse.ArgumentParser()
parser.add_argument(&#39;-p&#39;, &#39;--path&#39;, type=str)
args = parser.parse_args()
print(&#39;loading data from disk... may take a long time...&#39;)
global_large_data = np.load(args.path)

def worker(row_id):
    # some stuff read-only to the global_large_data
    time.sleep(0.01)
    print(row_id, np.sum(global_large_data[row_id]))

def main():
    pool = mp.Pool(mp.cpu_count())
    pool.map(worker, range(global_large_data.shape[0]))
    pool.close()
    pool.join()

if __name__ == &#39;__main__&#39;:
    main()

And in terminal,

$ python3 main.py -p /path/to/large_data.npy

This is fast, and almost good to me. However, one shortcoming is that each child process needs to reload the large file from disk, and the loading process can waste a lot of time.

Is there any way (e.g., wrapper) so that only the parent process loads the file from disk once, and then directly send the copy to each child process's memory?

Note that my memory space is abundant -- many copies of this large data in memory is good. I just don't want to reload it from disk for many times.

答案1

得分: 1

我怀疑您想阅读“上下文和启动方法”部分，在多进程部分。

新进程可以通过生成或分叉创建。如果生成，那么子进程是全新的Python进程，必须重新读取运行所需的所有内容。如果是分叉，父进程会创建自身的克隆。

文档中包括您的操作系统上的默认设置（您没有指定），如何更改默认设置以及可用选项。如果您可以在您的机器上使用“分叉”，那么在您阅读文件后，它将在所有子进程中都存在。

如果您不能使用“分叉”，那么您正在寻找的将会非常困难。正如它所说，每个子进程都将重新启动。

您是正确的，您不希望使用托管数组。这意味着所有对数据的请求都要经过主进程，然后以所请求的字节回复。是的，非常慢。

您可以考虑查看内存映射（mmap）。在这种情况下，每个进程仅读取其所需的文件部分，而不是整个文件。但文件本身仍在磁盘上，必须进行读取。

英文:

I suspect you want to read the section "contexts and start methods", in the multiprocessing section.

A new process is created either via spawning or forking. If spawned, then the child process is a completely new Python process, and it has to re-read everything it needs to run. If forked, the parent process creates a clone of itself.

The documentation includes which is the default on your OS (you didn't specify), how to change the default, and what is available. If you can manage to use 'fork' on your machine, then after you've read the file, it will be in all child processes.

If you cannot use 'fork', then what you're looking for is very difficult. As it says, every child process stars anew.

You are correct that you do not want to use a managed array. That means that all requests for data are routed through the main process, which then replies with the requested bytes. Yes, very slow.

You might consider looking at mmap. In this case, each process reads only the parts of the file that it needs rather than the whole thing. But the file itself is still on disk and has to be read.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python multiprocessing: sharing global real-only large data without reloading from disk for child processes

问题

答案1

为什么在多进程的两个不同进程中对象是相同的？

从UNV文件读取数据

创建 EC2 实例，启动实例并使用 Boto3 运行 Linux 命令

如何匹配一个空字典？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论