英文:
How to vectorise triple for looped cumulative sum
问题
我想对三重求和进行向量化处理,以便得到一个矩阵 A,其中 A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm},其中 k = 1,...,I,l = 1, ...,J,以避免不必要的重复计算。
我目前使用以下代码:
np.cumsum(np.cumsum(np.cumsum(a, axis=0), axis=1), axis=2).diagonal(axis1=1, axis2=2)
但这个方法效率较低,因为它执行了许多额外的计算,并最后使用 diagonal 方法提取正确的矩阵。我无法想出如何加快这个过程。
英文:
I want to vectorise the triple sum
\sum_{i=1}^I\sum_{j=1}^J\sum_{m=1}^J a_{ijm}
such that I end up with a matrix
A \in \mathbb{R}^{I \times J}
where A_{kl} = \sum_{i=1}^k\sum_{j=1}^l\sum_{m=1}^l a_{ijm} for k = 1,...,I and l = 1, ...,J
carrying forward the sums to avoid pointless recomputation.
I currently use this code:
np.cumsum(np.cumsum(np.cumsum(a, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
but it is inefficient as it does lots of extra work and extracts the correct matrix at the end with the diagonal method. I can't think of how to make this faster.
答案1
得分: 2
主要挑战在于计算内部的两个求和,即来自左上角的矩阵的平方切片的求和。最终的总和只是在第0轴上的累积和。
设置:
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
你的实现:
%%timeit
out = np.cumsum(np.cumsum(np.cumsum(arr, axis=0), axis=1), axis=2).diagonal(axis1=1, axis2=2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
通过在累积和0轴之前取对角线改进了你的实现:
%%timeit
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
最后,一些 tril/triu 的技巧:
%%timeit
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
这已经更好了,但诚然还不够理想。我不认为有更好的方法来使用纯numpy计算上述内部两个求和。
英文:
The main challenge here is to compute the inner two sums, i.e. the sum of the square slices of a matrix originating from the top left. The final sum is just a cumsum on top of that along the 0th axis.
Setup:
import numpy as np
I, J = 100, 100
arr = np.random.rand(I, J, J)
Your implementation:
%%timeit
out = np.cumsum(np.cumsum(np.cumsum(arr, axis = 0), axis = 1), axis = 2).diagonal(axis1 = 1, axis2 = 2)
# 10.9 ms ± 162 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Your implementation improved by taking the diagonal before cumsumming over the 0th axis:
%%timeit
out = arr.cumsum(axis=1).cumsum(axis=2).diagonal(axis1=1, axis2=2).cumsum(axis=0)
# 6.25 ms ± 34.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Finally, some tril/triu trickery:
%%timeit
out = np.cumsum(np.cumsum(np.tril(arr, k=-1).sum(axis=2) + np.triu(arr).sum(axis=1), axis=1), axis=0)
# 3.15 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
which is already better, but admittedly still not ideal. I don't see a better way to compute the inner two sums noted above with pure numpy.
答案2
得分: 1
你可以使用Numba来生成非常快速的实现。以下是代码部分:
import numba as nb
import numpy as np
@nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# 沿轴1和2并提取对角线的并行累加
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# 沿轴0累加
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
以下是在我的6核i5-9600KF上使用100x100x100的float64输入数组的性能结果:
初始代码: 12.7 毫秒
Chryophylaxs v1: 7.1 毫秒
Chryophylaxs v2: 5.5 毫秒
Numba: 0.2 毫秒
这个实现比其他所有实现都要快得多。它比初始实现快大约64倍。实际上,在我的机器上,它是最优的,因为它完全饱和了我的RAM的带宽,仅用于读取输入数组(这是必需的)。请注意,不要为非常小的数组使用多个线程。
请注意,这段代码还使用更少的内存,因为它只需要8 * nk * num_threads
字节的临时存储,而初始解决方案需要16 * ni * nj * nk
字节。
英文:
You can use Numba so to produce a very fast implementation. Here is the code:
import numba as nb
import numpy as np
@nb.njit('(float64[:,:,::1],)', parallel=True)
def compute(arr):
ni, nj, nk = arr.shape
assert nj == nk
result = np.empty((ni, nj))
# Parallel cumsum along the axis 1 and 2 + extraction of the diagonal
for i in nb.prange(ni):
tmp = np.zeros(nk)
for j in range(nj):
for k in range(nk):
tmp[k] += arr[i, j, k]
result[i, j] = np.sum(tmp[:j+1])
# Cumsum along the axis 0
for i in range(1, ni):
for k in range(nk):
result[i, k] += result[i-1, k]
return result
result = compute(a)
Here are performance results on my 6-core i5-9600KF with a 100x100x100 float64 input array:
Initial code: 12.7 ms
Chryophylaxs v1: 7.1 ms
Chryophylaxs v2: 5.5 ms
Numba: 0.2 ms
This implementation is significantly faster than all others. It is about 64 times faster than the initial implementation. It is also actually optimal on my machine since it completely saturate the bandwidth of my RAM only for reading the input array (which is mandatory). Note that it is better not to use multiple threads for very small arrays.
Note that this code also use far less memory as it only need 8 * nk * num_threads
bytes of temporary storage as opposed to 16 * ni * nj * nk
bytes for the initial solution.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论