2023年2月6日 14:23:04go评论151阅读模式

英文:

How to use read-only, shared memory (as NumPy arrays) in multiprocessing

问题

按照共享内存的文档这里，我已经实现了一个在池中由工作进程调用的函数中访问使用共享内存支持的NumPy数组的最小示例。我假设这段代码应该为每个额外的工作进程产生最小的内存开销（复制解释器和非共享变量会产生一些开销，但不应该复制16GB的内存。）

import numpy as np
from multiprocessing import Pool, shared_memory
from itertools import product
from tqdm import tqdm

if __name__ == "__main__":
    a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
    a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
    b_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
    b = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=b_shared_memory.buf)
    
    def test_func(args):
        a[args] + b[args[:-1]]
        
    with tqdm(total=20 * 100 * 100 * 100) as pbar:    
        with Pool(16) as pool:
            for _ in pool.imap_unordered(test_func,
                                         product(range(20), range(100), range(100), range(100)),
                                         chunksize=16):
                pass

然而，实际运行时，当运行此代码时，每个进程中的内存使用量会随时间增加，包括RES内存度量和SHR内存度量，如top命令所报告的。（内存积累的速度可以通过在test_func函数内选择的数组的大小来修改。）

这种行为令我感到困惑 - 这些数组位于共享内存中，因此我会认为对它们的视图不应该产生任何内存分配（我在Linux上进行测试，因此只有在读取时才会发生复制）。此外，我甚至没有将此计算的结果存储在任何地方，因此不清楚为什么会分配内存。

还有两个需要注意的地方：

根据这个答案，即使从共享内存中读取/访问数组也会强制进行复制+写入，因为必须更新引用计数。但这应该只影响头内存页，大约为4KB。为什么内存继续增长？
如果我简单地以以下方式更改代码：

def test_func(args):
    a[args], b[args[:-1]]

问题就解决了 - 没有内存开销（即内存是共享的），并且随时间不会增加内存分配。

我试图以最简单、最直观的方式应用文档中的内容到使用共享内存的多进程，但仍然不清楚为什么它不按预期工作。我想在test_func中执行一些简单的计算，包括查看共享内存、加法、矩阵-向量乘法等。如果能帮助我更好地理解如何正确使用共享内存，将不胜感激。

更新：
当我将test_func的代码更改为a[0, 0, 0, 0] + b[0, 0, 0]时，问题消失了。这是否意味着NumPy数组中间存在一些引用计数？因此，当args发生变化时，访问数组的不同部分，内存会增加，但如果索引始终相同，则内存不会增加。

英文:

Following the documentation for shared memory here, I have implemented a minimal example of accessing NumPy arrays backed with shared memory in a function called by a worker process in a pool. My assumption is that this code should produce minimal memory overhead for each additional worker (there is some overhead to copy the interpreter and non-shared variables, but the 16GB of memory should not be copied.)

import numpy as np
from multiprocessing import Pool, shared_memory
from itertools import product
from tqdm import tqdm

if __name__ == &quot;__main__&quot;:
	a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
	a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
	b_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
	b = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=b_shared_memory.buf)
			
	def test_func(args):
		a[args] + b[args[:-1]]
		
	with tqdm(total=20 * 100 * 100 * 100) as pbar:	
		with Pool(16) as pool:
			for _ in pool.imap_unordered(test_func,
										 product(range(20), range(100), range(100), range(100)),
										 chunksize=16):
				pass

However, in practice when running this code memory usage grows in each process over time, both in the RES memory metric as well as the SHR memory metric as reported by top. (The rate of accumulation of memory can be modified with the size of the arrays being selected inside the test_func function.)

This behavior is confusing to me – these arrays are in shared memory, and I would therefore assume that a view of them shouldn't incur any memory allocation (I am testing on linux, so no copying should occur only with reading.) Further, I don't even store the results of this computation anywhere, so it is unclear why memory is being allocated.

Two further notes:

According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?
If I simply change the code in the following way:

def test_func(args):
    a[args], b[args[:-1]]

the issues resolve – there is no memory overhead (ie. memory is shared,) and no increasing memory allocation over time.

I've tried to present the simplest, most intuitive application of the documentation to multiprocessing with shared memory, yet it remains very unclear to me how and why it isn't working as expected. I would like to perform some simple calculations in the test_func, including viewing the shared memory, adding, matrix - vector multiplication etc. Any help in getting a better grasp of how to use shared memory correctly would be very appreciated.

Update:
When I change the test_func code to a[0, 0, 0, 0] + b[0, 0, 0] the issue disappears. Does this mean that there is some reference counter in the middle of the NumPy arrays? Such that when args is changing, different parts of the array are accessed and memory increases, but if the indexes are always the same, the memory doesn't increase.

答案1

得分: 3

以下是您要求的内容的翻译：

However, in practice when running this code memory usage grows in each process over time, both in the RES memory metric as well as the SHR memory metric as reported by top.

然而，在实际运行这段代码时，每个进程的内存使用量会随时间增长，无论是RES内存度量还是top命令报告的SHR内存度量。

This is normal, but this is not because of a copy nor any allocation done from the interpreter. This is because of page faults and virtual memory. Indeed, shared memory buffer is created and have a virtual addresse space, but an operating system (OS) like Linux does not directly map it physically in RAM. This is because reserving the space would not be efficient as many application allocate space they do not fully use, or at least not directly. Linux maps the virtual pages to physical pages during the first touch, that is, during the first read or write. For security reasons, Linux fill the mapped pages with zeros, even when you just read them (since the RAM may contains password from other sensitive applications like your browser). The growing memory is due to pages being slowly filled with zeros and mapped to physical memory.

这是正常的，但这不是因为从解释器进行了复制或分配。这是由于页面错误和虚拟内存。事实上，共享内存缓冲区被创建并具有虚拟地址空间，但像Linux这样的操作系统不会直接将其映射到物理RAM中。这是因为保留空间不是高效的，因为许多应用程序分配的空间并没有完全使用，或者至少不是直接使用。Linux在第一次触摸时将虚拟页面映射到物理页面，即在第一次读取或写入时。出于安全原因，Linux会在您只是读取它们时（因为RAM中可能包含其他敏感应用程序的密码）将映射的页面填充为零。内存增长是由于页面被慢慢填充为零并映射到物理内存中。

If you do not want this to happen, you can just fill the array with zeros manually using just a.fill(0) and b.fill(0) before the multiprocessing-based computation. On my Linux machine, this reserve the space in physical memory and to not reserve more space after that.

如果您不希望发生这种情况，您可以在使用多进程进行计算之前，只需使用a.fill(0)和b.fill(0)手动填充数组为零。在我的Linux机器上，这将在物理内存中保留空间，并在此后不再保留更多空间。

Note that Linux is an example of operating system doing that but Windows behave quite similarly (AFAIK MacOS too). Also note that some (rare) systems are configured to physically map the memory directly for the sake of performance (eg. some game platforms and HPC systems).

请注意，Linux是这样做的操作系统的一个示例，但Windows行为相似（据我所知，MacOS也是如此）。还要注意，一些（罕见的）系统配置为出于性能考虑直接映射内存（例如，某些游戏平台和HPC系统）。

When I change the test_func code to a[0, 0, 0, 0] + b[0, 0, 0] the issue disappears.

当我将test_func代码更改为a[0, 0, 0, 0] + b[0, 0, 0]时，问题消失了。

This is because only the first page of the shared memory buffer is read causing only a first touch on this page (so mapped in physical memory). Other pages are still left untouched and so they are only mapped in virtual memory and not physical memory. At least on mainstream systems like yours and mine.

这是因为只读取了共享内存缓冲区的第一页，因此只在这一页上进行了第一次触摸（因此在物理内存中映射）。其他页面仍然没有被触及，因此它们仅在虚拟内存中映射，而不是物理内存中。至少在像您和我的这种主流系统上是如此。

According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However, this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?

根据这个答案，即使从共享内存中读取/访问数组也会强制进行复制+写入，因为必须更新引用计数。然而，这只应影响标头内存页面，该页面应该约为4KB。为什么内存继续增长呢？

This is rather true. However, Numpy arrays do not hold the buffer so the reference counting do not impact the shared buffer but the Numpy array which are actually a view of the shared buffer. In practice, Numpy arrays are always views (although the internal buffer associated to a given array may not be shared by any other instance). Numpy is responsible for allocating and collecting the internal buffer if needed (except for shared buffers like this that are not owned by Numpy).

这相当正确。然而，Numpy数组不保存缓冲区，因此引用计数不会影响共享缓冲区，而只会影响实际上是共享缓冲区视图的Numpy数组。实际上，Numpy数组始终是视图（尽管与给定数组关联的内部缓冲区可能不会被任何其他实例共享）。Numpy负责根据需要分配和收集内部缓冲区（除了像这样不归Numpy所有的共享缓冲区）。

If I simply change the code in the following way: [...] a[args], b[args[:-1]] [...] the issues resolve.

如果我只是简单地按照以下方式更改代码：[...] a[args], b[args[:-1]] [...]，问题就会得到解决。

This is expected

英文:

> However, in practice when running this code memory usage grows in each process over time, both in the RES memory metric as well as the SHR memory metric as reported by top.

Note that Linux is an example of operating system doing that but Windows behave quite similarly (AFAIK MacOS too). Also note that some (rare) systems are configured to physically map the memory directly for sake of performance (eg. some game platforms and HPC systems).

> When I change the test_func code to a[0, 0, 0, 0] + b[0, 0, 0] the issue disappears.

This is because only the first page of the shared memory buffer is read causing only a first touch on this page (so mapped in physical memory). Other pages are still left untouched and so they are only mapped in vritual memory and not physical memory. At least on mainstream systems like your and mine.

> According to this answer, even reading / accessing an array from shared memory will force a copy + write, since the refcount must be updated. However this should only affect the header memory page, which should be about 4kb. Why does memory continue to grow?

> If I simply change the code in the following way: [...] a[args], b[args[:-1]] [...] the issues resolve.

This is expected, but a bit tricky to understand since it combine the magic of the OS with the one of Numpy. Indeed, a[args] and b[args[:-1]] are Numpy view so they do not read the memory of the shared buffer unless you read the content of view (not done here). If you write a[args][0], then the memory is read and the memory consumption appears to grow. The same thing is true for any Numpy function reading/writing data of the a and b views, like np.sum(a[args]).

Important general notes

Note that the mapped shared memory must be freed using either close on each instance or unlink from the main process. This is critical to avoid a system resource leak.

To prove the share buffer are truly shared, one can test the following (Posix-only) program:

import numpy as np
from multiprocessing import Pool, shared_memory
import os

a_shared_memory = shared_memory.SharedMemory(create=True, size=8_000_000_000)
a = np.ndarray((20, 100, 100, 100, 100), np.float32, buffer=a_shared_memory.buf)
a[0, 0, 0, 0, 0] = 42
print(&#39;from init:&#39;, a[0, 0, 0, 0, 0])

pid = os.fork()
if pid: # parent
    os.wait()
    print(&#39;from parent (after the wait):&#39;, a[0, 0, 0, 0, 0])
    a_shared_memory.unlink()
    del a
else: # child
    print(&#39;from child (before):&#39;, a[0, 0, 0, 0, 0])
    a[0, 0, 0, 0, 0] = 815
    print(&#39;from child (after):&#39;, a[0, 0, 0, 0, 0])
    exit()

Which prints:

from init: 42.0
from child (before): 42.0
from child (after): 815.0
from parent (after the wait): 815.0

Note on the portability of the code

Your code does not run on my machine running on Windows. It turns out you made assumptions that are at least not portable and certainly non standard. For example, test_func should not be accessible from sub-processes since it is located in the main section and sub-processes does not run it. As a result, there is an error. On Linux and more generally Posix platforms, processes are created using the fork system call. Forked process are almost the same processes (like two cells after a division): from the interpreter point of view, they have a very close memory state : the children processes have a and b defined in the environment as well as a_shared_memory and b_shared_memory. Accessing to them is non standard, but it works on Posix. I think SharedMemoryManager should be used in this context. Alterntatively, I think you can name the shared memory section so to access to them from the child process without accessing global variables (which is a very bad practice in software engineering).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何在多进程中使用只读的共享内存（作为NumPy数组）。

问题

答案1

Important general notes

Note on the portability of the code

HTTP回调URL与WebSocket在异步响应方面有何区别？

Python asyncio sleep – taking excessive time.

Django视图不会通过AJAX保存上传的图像。

使用来自一个列表的值在另一个列表中

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论