英文:
How can I share large numpy arrays between python processes, e.g. jupyter notebooks, without duplicating them in memory?
问题
我有大型的NumPy数组,想要在同一台机器上的其他Python进程之间共享,而不占用内存副本。具体来说,我的用例是在Linux上在Jupyter笔记本之间共享数组。我该如何做到这一点?
英文:
I have large numpy arrays that I want to share with other python processes on the same machine without holding copies in memory. Specifically, my usecase is to share the array between jupyter notebooks on linux. How can I do this?
答案1
得分: 1
Scenario 1: 数组已经在进程1的内存中,你想要与进程2、3、4等共享它。
给定一个numpy数组arr,以下是如何将其复制到共享内存中:
from multiprocessing import shared_memory
arr = np.array([1,2,3]) # 在内存中的数组,你想要共享它
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)
在进程2、3、4中,可以像下面这样访问该数组:
shape = 3, # 根据进程1中数组的形状进行调整
dtype = 'int64' # 根据进程1中数组的dtype进行调整
shm = shared_memory.SharedMemory(name='psm_554627ea') # 根据上面打印的名称进行调整
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # 现在这个数组与进程1中的(共享内存的)数组相同
Scenario 2: 数组位于磁盘上。这允许直接将其加载到共享内存中,避免了上面的copyto操作,因此更节省内存。
假设数组以前使用tofile方法保存到磁盘上:
arr.tofile('/path/to/arr')
现在,在进程1中(例如第一个jupyter笔记本)中,我们可以直接将其加载到共享内存中:
shape = 3, # 根据以前保存在磁盘上的数组的形状进行调整。注意:形状和dtype不会存储在文件中
dtype = 'int64' # 调整为相应的dtype
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf) # 从共享内存中创建一个numpy数组
with open(path, 'rb') as f:
f.readinto(shm.buf) # 用来自磁盘的数据填充数组
print(shm.name) #
在进程2、3、4中,你可以像在场景1中那样访问该数组。
英文:
Scenario 1: The array is already in memory in process 1 and you want to share it with process 2,3,4, ...
Given a numpy array arr, here is how it can be copied to shared memory:
from multiprocessing import shared_memory # requires python > 3.8
arr = np.array([1,2,3]) # array in memory that you want to share
shm = shared_memory.SharedMemory(create=True, size=arr.nbytes)
shm_arr = np.ndarray(arr.shape, dtype=arr.dtype, buffer=shm.buf)
np.copyto(shm_arr, arr)
print(shm.name)
In the process 2,3,4, which can be other jupyter notebooks, you can then access the array as follows:
shape = 3, # adapt to shape of the array in process 1
dtype = 'int64' # adapt to dtype of the array in process 1
shm = shared_memory.SharedMemory(name='psm_554627ea') # adapt to name printed aove
shm_arr = np.ndarray(shape, dtype, buffer=shm.buf) # this now is the same array as in (sharing the memory with) process 1
Scenario 2: The array is on disk. This allows to directly load it into shared memory, avoiding the copyto operation above and is therefore more memory efficient.
Assume the array has been saved to disk in the past using the tofile method:
arr.tofile('/path/to/arr')
Now, in process 1 (e.g. the first jupyter notebook), we can directly load it into shared memory:
shape = 3, # adapt to shape of the array saved in the past. Note: shape and dtype are not stored in the file
dtype = 'int64' # adapt
shm = shared_memory.SharedMemory(create=True, size=np.prod(shape) * np.dtype(dtype).itemsize)
shm_arr = np.ndarray(shape, dtype=dtype, buffer=shm.buf) # create a numpy array from share memory
with open(path, 'rb') as f:
f.readinto(shm.buf) # fill the array with the data from disk
print(shm.name) #
In process 2,3,4, you can then access the array as in scenario 1.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论