2023年4月20日 05:31:03go评论102阅读模式

英文:

How can I share large numpy arrays between python processes, e.g. jupyter notebooks, without duplicating them in memory?

问题

我有大型的NumPy数组，想要在同一台机器上的其他Python进程之间共享，而不占用内存副本。具体来说，我的用例是在Linux上在Jupyter笔记本之间共享数组。我该如何做到这一点？

英文:

I have large numpy arrays that I want to share with other python processes on the same machine without holding copies in memory. Specifically, my usecase is to share the array between jupyter notebooks on linux. How can I do this?

答案1

得分: 1

Scenario 1: 数组已经在进程1的内存中，你想要与进程2、3、4等共享它。

给定一个numpy数组arr，以下是如何将其复制到共享内存中：

from multiprocessing import shared_memory
arr = np.array([1,2,3]) # 在内存中的数组，你想要共享它
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)

在进程2、3、4中，可以像下面这样访问该数组：

shape = 3, # 根据进程1中数组的形状进行调整
dtype = 'int64' # 根据进程1中数组的dtype进行调整
shm = shared_memory.SharedMemory(name='psm_554627ea') # 根据上面打印的名称进行调整
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # 现在这个数组与进程1中的（共享内存的）数组相同

Scenario 2: 数组位于磁盘上。这允许直接将其加载到共享内存中，避免了上面的copyto操作，因此更节省内存。

假设数组以前使用tofile方法保存到磁盘上：

arr.tofile('/path/to/arr')

现在，在进程1中（例如第一个jupyter笔记本）中，我们可以直接将其加载到共享内存中：

shape = 3, # 根据以前保存在磁盘上的数组的形状进行调整。注意：形状和dtype不会存储在文件中
dtype = 'int64' # 调整为相应的dtype
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)  # 从共享内存中创建一个numpy数组
with open(path, 'rb') as f:
    f.readinto(shm.buf) # 用来自磁盘的数据填充数组
print(shm.name) #

在进程2、3、4中，你可以像在场景1中那样访问该数组。

英文:

Scenario 1: The array is already in memory in process 1 and you want to share it with process 2,3,4, ...

Given a numpy array arr, here is how it can be copied to shared memory:

from multiprocessing import shared_memory # requires python &gt; 3.8
arr = np.array([1,2,3]) # array in memory that you want to share
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)

In the process 2,3,4, which can be other jupyter notebooks, you can then access the array as follows:

shape = 3, # adapt to shape of the array in process 1
dtype = &#39;int64&#39; # adapt to dtype of the array in process 1
shm = shared_memory.SharedMemory(name=&#39;psm_554627ea&#39;) # adapt to name printed aove
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # this now is the same array as in (sharing the memory with) process 1

Scenario 2: The array is on disk. This allows to directly load it into shared memory, avoiding the copyto operation above and is therefore more memory efficient.

Assume the array has been saved to disk in the past using the tofile method:

arr.tofile(&#39;/path/to/arr&#39;)

Now, in process 1 (e.g. the first jupyter notebook), we can directly load it into shared memory:

shape = 3, # adapt to shape of the array saved in the past. Note: shape and dtype are not stored in the file
dtype = &#39;int64&#39; # adapt 
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)  # create a numpy array from share memory  
with open(path, &#39;rb&#39;) as f:
    f.readinto(shm.buf) # fill the array with the data from disk
print(shm.name) #

In process 2,3,4, you can then access the array as in scenario 1.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

How can I share large numpy arrays between python processes, e.g. jupyter notebooks, without duplicating them in memory?

问题

答案1

使用pytest进行单元测试URLs。

正则表达式，匹配不定大小的变量字节组。

Flask Celery – RabbitMQ连接不起作用

Python – 将字符串转换为字典，其中键是副标题，值是链接。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。