英文:
Populating large matrix with values
问题
我有一个100K乘以12乘以100K的矩阵,需要用计算结果填充它。我尝试使用numpy.empty创建它,但遇到了内存错误。
所以我转而使用了dask。我能够创建dask数组。我正在运行一个函数,在for循环中遍历第0和第1维,创建一个向量。然后,我将这个向量填充到矩阵的第i,j位置。如果我只是像现在这样填充dask数组,仅赋值步骤就需要50毫秒,这在矩阵的所有原子单元中进行外推时太长了。
看起来应该可以通过dask的delayed函数加速赋值过程,但我无法弄清楚。
以下是不使用延迟的示例:
import dask.array as da
import dask.delayed as delayed
from dask import compute
import numpy as np
test_arr = da.empty(shape=(10000, 12, 10000), dtype='float32')
for i in range(test_arr.shape[0]):
for j in range(test_arr.shape[1]):
vals = np.random.normal(size=test_arr.shape[2])
test_arr[i,j,:] = vals
这是我尝试使用delayed的示例:
def populate_array(i, j, vec):
test_arr[i, j, :] = vec
return test_arr
for i in range(test_arr.shape[0]):
for j in range(test_arr.shape[1]):
vals = np.random.normal(size=test_arr.shape[2])
delayed(populate_array)(i, j, vals)
compute(test_arr)
后者不会出错,但似乎只返回一个全部为零的数组。
我知道也可以通过去掉for循环并进行矢量化来加速这个过程,但假设目前不可行。
我并不一定要使用dask,但它似乎是从pandas / numpy过来的话,具有熟悉语法的实际方法。
更新:
接受的答案有效,但任务流中有很多空白空间。我提出这个问题是因为我的实际用例中有一个复杂的create_array_chunk公式,它仅挂起。无法看到仪表板或发生了什么。
1: https://i.stack.imgur.com/TeElO.gif
英文:
I have a 100K by 12 by 100K matrix that I need to populate with computation results. I tried creating it using numpy.empty but got a memory error.
So I turned to dask instead. I'm able to create the dask array. I'm running a function that creates a vector as I traverse through the 0th and 1st dimension in a for loop. I then populate this vector into the i,jth position of the matrix. If I just populate the dask array as is, just the assignment step takes 50 milliseconds, which is way too long when extrapolated for all atomic cells in the matrix.
It seems it should be possible to speed up the assignment with dask's delayed function, but can't figure it out.
Here's how this would look without delay:
import dask.array as da
import dask.delayed as delayed
from dask import compute
import numpy as np
test_arr = da.empty(shape=(10000, 12, 10000), dtype='float32')
for i in range(test_arr.shape[0]):
for j in range(test_arr.shape[1]):
vals = np.random.normal(size=test_arr.shape[2])
test_arr[i,j,:] = vals
And here is my attempt at using delay:
def populate_array(i, j, vec):
test_arr[i, j, :] = vec
return test_arr
for i in range(test_arr.shape[0]):
for j in range(test_arr.shape[1]):
vals = np.random.normal(size=test_arr.shape[2])
delayed(populate_array)(i, j, vals)
compute(test_arr)
The latter doesn't error but just seems to return an array with all zeroes.
I know that I can also speed this up by getting rid of the for loop and vectorizing but assume that is currently not feasible.
I'm not tied to dask per se but it seems like a practical approach with a familiar syntax if coming from pandas / numpy.
Update:
Accepted answer works but the task stream has a lot of blank spaces. I bring this up because my actual use case with a complex create_array_chunk formula just hangs. Cannot see the dashboard or what's going on.
答案1
得分: 1
这是我会做的方式。您不会填充现有的Dask数组,而是逐块构建它:
import dask.array as da
import dask.delayed as delayed
import numpy as np
shape = (10000, 12, 10000)
def create_array_chunk(i, j, k):
# 这里可能应该使用i和j的信息
return np.random.normal(size=k)
i_arrays = []
for i in range(shape[0]):
j_arrays = []
for j in range(shape[1]):
darray = da.from_delayed(delayed(create_array_chunk)(i, j, shape[2]), dtype=np.float64, shape=(shape[2],))
j_arrays.append(darray)
j_stack=da.stack(j_arrays, axis=0)
i_arrays.append(j_stack)
j_stack = da.stack(i_arrays, axis=0)
j_stack是一个形状为(10000, 12, 10000)的Dask数组结构,如下所示:
它是一个惰性结构,尚未计算任何内容。请注意,如果您对其进行compute操作,它将转换为NumPy数组并占用大量内存。您可能希望使用to_zarr
或类似方法将其流式传输到磁盘中。
英文:
This is how I'd do it. You don't fill an existing Dask Array, you build it chunk by chunk:
import dask.array as da
import dask.delayed as delayed
import numpy as np
shape = (10000, 12, 10000)
def create_array_chunk(i, j, k):
# should use i and j information probably here
return np.random.normal(size=k)
i_arrays = []
for i in range(shape[0]):
j_arrays = []
for j in range(shape[1]):
darray = da.from_delayed(delayed(create_array_chunk)(i, j, shape[2]), dtype=np.float64, shape=(shape[2],))
j_arrays.append(darray)
j_stack=da.stack(j_arrays, axis=0)
i_arrays.append(j_stack)
j_stack = da.stack(i_arrays, axis=0)
j_stack is a Dask Array structure of shape (10000, 12, 10000), has can be seen below:
It's a lazy structure, nothing has been computed yet. Be careful, if you call compute on it it will convert it to a Numpy array and take a lot of memory. You might want to stream it to disk using to_zarr
or equivalent.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论