2023年7月14日 05:39:23go评论106阅读模式

英文:

Populating large matrix with values

问题

我有一个100K乘以12乘以100K的矩阵，需要用计算结果填充它。我尝试使用numpy.empty创建它，但遇到了内存错误。

所以我转而使用了dask。我能够创建dask数组。我正在运行一个函数，在for循环中遍历第0和第1维，创建一个向量。然后，我将这个向量填充到矩阵的第i，j位置。如果我只是像现在这样填充dask数组，仅赋值步骤就需要50毫秒，这在矩阵的所有原子单元中进行外推时太长了。

看起来应该可以通过dask的delayed函数加速赋值过程，但我无法弄清楚。

以下是不使用延迟的示例：

import dask.array as da
import dask.delayed as delayed
from dask import compute
import numpy as np
test_arr = da.empty(shape=(10000, 12, 10000), dtype='float32')
for i in range(test_arr.shape[0]):
    for j in range(test_arr.shape[1]):
        vals = np.random.normal(size=test_arr.shape[2])
        test_arr[i,j,:] = vals

这是我尝试使用delayed的示例：

def populate_array(i, j, vec):
    test_arr[i, j, :] = vec
    return test_arr
for i in range(test_arr.shape[0]):
    for j in range(test_arr.shape[1]):
        vals = np.random.normal(size=test_arr.shape[2])
        delayed(populate_array)(i, j, vals)
compute(test_arr)

后者不会出错，但似乎只返回一个全部为零的数组。
我知道也可以通过去掉for循环并进行矢量化来加速这个过程，但假设目前不可行。

我并不一定要使用dask，但它似乎是从pandas / numpy过来的话，具有熟悉语法的实际方法。

更新：
接受的答案有效，但任务流中有很多空白空间。我提出这个问题是因为我的实际用例中有一个复杂的create_array_chunk公式，它仅挂起。无法看到仪表板或发生了什么。

1: https://i.stack.imgur.com/TeElO.gif

英文:

I have a 100K by 12 by 100K matrix that I need to populate with computation results. I tried creating it using numpy.empty but got a memory error.

So I turned to dask instead. I'm able to create the dask array. I'm running a function that creates a vector as I traverse through the 0th and 1st dimension in a for loop. I then populate this vector into the i,jth position of the matrix. If I just populate the dask array as is, just the assignment step takes 50 milliseconds, which is way too long when extrapolated for all atomic cells in the matrix.

It seems it should be possible to speed up the assignment with dask's delayed function, but can't figure it out.

Here's how this would look without delay:

import dask.array as da
import dask.delayed as delayed
from dask import compute
import numpy as np
test_arr = da.empty(shape=(10000, 12, 10000), dtype=&#39;float32&#39;)
for i in range(test_arr.shape[0]):
    for j in range(test_arr.shape[1]):
        vals = np.random.normal(size=test_arr.shape[2])
        test_arr[i,j,:] = vals

And here is my attempt at using delay:

def populate_array(i, j, vec):
    test_arr[i, j, :] = vec
    return test_arr
for i in range(test_arr.shape[0]):
    for j in range(test_arr.shape[1]):
        vals = np.random.normal(size=test_arr.shape[2])
        delayed(populate_array)(i, j, vals)
compute(test_arr)

The latter doesn't error but just seems to return an array with all zeroes.
I know that I can also speed this up by getting rid of the for loop and vectorizing but assume that is currently not feasible.

I'm not tied to dask per se but it seems like a practical approach with a familiar syntax if coming from pandas / numpy.

Update:
Accepted answer works but the task stream has a lot of blank spaces. I bring this up because my actual use case with a complex create_array_chunk formula just hangs. Cannot see the dashboard or what's going on.

答案1

得分: 1

这是我会做的方式。您不会填充现有的Dask数组，而是逐块构建它：

import dask.array as da
import dask.delayed as delayed
import numpy as np
shape = (10000, 12, 10000)
def create_array_chunk(i, j, k):
    # 这里可能应该使用i和j的信息
    return np.random.normal(size=k)
i_arrays = []
for i in range(shape[0]):
    j_arrays = []
    for j in range(shape[1]):
        darray = da.from_delayed(delayed(create_array_chunk)(i, j, shape[2]), dtype=np.float64, shape=(shape[2],))
        j_arrays.append(darray)
    j_stack=da.stack(j_arrays, axis=0)
    i_arrays.append(j_stack)
j_stack = da.stack(i_arrays, axis=0)

j_stack是一个形状为(10000, 12, 10000)的Dask数组结构，如下所示：

它是一个惰性结构，尚未计算任何内容。请注意，如果您对其进行compute操作，它将转换为NumPy数组并占用大量内存。您可能希望使用to_zarr或类似方法将其流式传输到磁盘中。

英文:

This is how I'd do it. You don't fill an existing Dask Array, you build it chunk by chunk:

import dask.array as da
import dask.delayed as delayed
import numpy as np
shape = (10000, 12, 10000)
def create_array_chunk(i, j, k):
    # should use i and j information probably here
    return np.random.normal(size=k)
i_arrays = []
for i in range(shape[0]):
    j_arrays = []
    for j in range(shape[1]):
        darray = da.from_delayed(delayed(create_array_chunk)(i, j, shape[2]), dtype=np.float64, shape=(shape[2],))
        j_arrays.append(darray)
    j_stack=da.stack(j_arrays, axis=0)
    i_arrays.append(j_stack)
j_stack = da.stack(i_arrays, axis=0)

j_stack is a Dask Array structure of shape (10000, 12, 10000), has can be seen below:

It's a lazy structure, nothing has been computed yet. Be careful, if you call compute on it it will convert it to a Numpy array and take a lot of memory. You might want to stream it to disk using to_zarr or equivalent.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

用值填充大矩阵

问题

答案1

为什么将getrefcount放在函数内时增加2？

添加一个数据目录，放在Python包目录之外。

从CSV文件中提取字符串输入中的数字的Pandas问题

格式化主题电子邮件

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。