How can I share large numpy arrays between python processes, e.g. jupyter notebooks, without duplicating them in memory?

huangapple go评论75阅读模式
英文:

How can I share large numpy arrays between python processes, e.g. jupyter notebooks, without duplicating them in memory?

问题

我有大型的NumPy数组,想要在同一台机器上的其他Python进程之间共享,而不占用内存副本。具体来说,我的用例是在Linux上在Jupyter笔记本之间共享数组。我该如何做到这一点?

英文:

I have large numpy arrays that I want to share with other python processes on the same machine without holding copies in memory. Specifically, my usecase is to share the array between jupyter notebooks on linux. How can I do this?

答案1

得分: 1

Scenario 1: 数组已经在进程1的内存中,你想要与进程2、3、4等共享它。

给定一个numpy数组arr,以下是如何将其复制到共享内存中:

from multiprocessing import shared_memory

arr = np.array([1,2,3]) # 在内存中的数组,你想要共享它
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)

在进程2、3、4中,可以像下面这样访问该数组:

shape = 3, # 根据进程1中数组的形状进行调整
dtype = 'int64' # 根据进程1中数组的dtype进行调整
shm = shared_memory.SharedMemory(name='psm_554627ea') # 根据上面打印的名称进行调整
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # 现在这个数组与进程1中的(共享内存的)数组相同

Scenario 2: 数组位于磁盘上。这允许直接将其加载到共享内存中,避免了上面的copyto操作,因此更节省内存。

假设数组以前使用tofile方法保存到磁盘上:

arr.tofile('/path/to/arr')

现在,在进程1中(例如第一个jupyter笔记本)中,我们可以直接将其加载到共享内存中:

shape = 3, # 根据以前保存在磁盘上的数组的形状进行调整。注意:形状和dtype不会存储在文件中
dtype = 'int64' # 调整为相应的dtype
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)  # 从共享内存中创建一个numpy数组
with open(path, 'rb') as f:
    f.readinto(shm.buf) # 用来自磁盘的数据填充数组
print(shm.name) # 

在进程2、3、4中,你可以像在场景1中那样访问该数组。

英文:

Scenario 1: The array is already in memory in process 1 and you want to share it with process 2,3,4, ...

Given a numpy array arr, here is how it can be copied to shared memory:

from multiprocessing import shared_memory # requires python > 3.8

arr = np.array([1,2,3]) # array in memory that you want to share
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)

In the process 2,3,4, which can be other jupyter notebooks, you can then access the array as follows:

shape = 3, # adapt to shape of the array in process 1
dtype = 'int64' # adapt to dtype of the array in process 1
shm = shared_memory.SharedMemory(name='psm_554627ea') # adapt to name printed aove
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # this now is the same array as in (sharing the memory with) process 1

Scenario 2: The array is on disk. This allows to directly load it into shared memory, avoiding the copyto operation above and is therefore more memory efficient.

Assume the array has been saved to disk in the past using the tofile method:

arr.tofile('/path/to/arr')

Now, in process 1 (e.g. the first jupyter notebook), we can directly load it into shared memory:

shape = 3, # adapt to shape of the array saved in the past. Note: shape and dtype are not stored in the file
dtype = 'int64' # adapt 
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf)  # create a numpy array from share memory  
with open(path, 'rb') as f:
    f.readinto(shm.buf) # fill the array with the data from disk
print(shm.name) # 

In process 2,3,4, you can then access the array as in scenario 1.

huangapple
  • 本文由 发表于 2023年4月20日 05:31:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76058957.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定