英文:
Optimizing np.einsum calls in Python
问题
I'll provide a translation of the code and relevant information without answering your question.
我有两个NumPy数组:维度为(N, N, N, N)
的X
和维度为(N, N)
的Y
。我的目标是尽快评估以下einsum
调用:
Z = np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
为此,我对上述三种实现进行了基准测试,它们产生相同的答案但进行了不同的优化:
def einsum_one(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
def einsum_two(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
def fast_einsum(X, Y):
Z = np.einsum('ij,iiii->ij', Y, X)
W = np.einsum('il,ik->ikl', Y, Y)
Z = np.einsum('ij,im->ijm', Z, Y)
Z = np.einsum('ijm,ikl->jklm', Z, W)
return Z
我得到fast_einsum
的方法是通过查看np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1]
的输出,输出如下所示:
完整的缩减: iiii,ij,ik,il,im->jklm
Naive缩放: 5
优化后的缩放: 5
Naive FLOP计数: 1.638e+05
优化后的FLOP计数: 6.662e+04
理论加速比: 2.459
最大中间值: 4.096e+03个元素
--------------------------------------------------------------------------
缩放 当前 剩余
--------------------------------------------------------------------------
2 ij,iiii->ij ik,il,im,ij->jklm
3 il,ik->ikl im,ij,ikl->jklm
3 ij,im->ijm ikl,ijm->jklm
5 ijm,ikl->jklm jklm->jklm
因此,我的期望是:对于大型N
,fast_einsum
应该是最快的,einsum_one
应该是最慢的。对于einsum_two
,在小型N
时应该有一些开销,但在大型N
时应该可以忽略。然而,当我对上述进行基准测试时,我得到了以下时间(数值以微秒为单位):
N | einsum_one | einsum_two | fast_einsum
4 15.1 949 13.8
8 256 966 87.9
12 1920 1100 683
16 7460 1180 2490
20 21800 1390 7080
因此,我的fast_einsum
实现不够优化。我的问题是:如何解释np.einsum_path
的输出,以便我可以尽可能快地实现np.einsum
,而不必包含optimize='optimal'
参数的开销?
英文:
I have two numpy arrays: X
of dimension (N,N,N,N)
and Y
of dimension (N,N)
. My goal is to evaluate the following einsum
call as fast as possible:
Z = np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
To do this, I benchmarked three implementations of the above, which yield identical answers but are optimized differently:
def einsum_one(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
def einsum_two(X, Y):
return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
def fast_einsum(X, Y):
Z = np.einsum('ij,iiii->ij', Y, X)
W = np.einsum('il,ik->ikl', Y, Y)
Z = np.einsum('ij,im->ijm', Z, Y)
Z = np.einsum('ijm,ikl->jklm', Z, W)
return Z
The way I got fast_einsum
was by seeing the output of np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1]
, which was:
Complete contraction: iiii,ij,ik,il,im->jklm
Naive scaling: 5
Optimized scaling: 5
Naive FLOP count: 1.638e+05
Optimized FLOP count: 6.662e+04
Theoretical speedup: 2.459
Largest intermediate: 4.096e+03 elements
--------------------------------------------------------------------------
scaling current remaining
--------------------------------------------------------------------------
2 ij,iiii->ij ik,il,im,ij->jklm
3 il,ik->ikl im,ij,ikl->jklm
3 ij,im->ijm ikl,ijm->jklm
5 ijm,ikl->jklm jklm->jklm
So, my expectation is as follows: for large N
, fast_einsum
should be the fastest, and einsum_one
should be the slowest. For einsum_two
there should be some overhead which is noticeable for small N
but is negligible for large N
. However, when I benchmarked the above, I obtained the following times (numerical values are in microseconds):
N | einsum_one | einsum_two | fast_einsum
4 15.1 949 13.8
8 256 966 87.9
12 1920 1100 683
16 7460 1180 2490
20 21800 1390 7080
So my implementation of fast_einsum
is not optimal. So my question is: how do I interpret the output of np.einsum_path
so that I can implement np.einsum
as fast as possible without having to include the overhead of the optimize='optimal'
parameter?
答案1
得分: 1
Here's the translated portion of the text:
在一个小的 N=10
(我的计算机不够大,无法使用100(8Mb的数组))
但是这里有一些快速的计时:
预先计算路径:
In [183]: path=np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=True)[0]
默认:
In [184]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=False)
2.12 毫秒 ± 4.3 微秒每次循环(均值 ± 7 次,100 次循环的标准偏差)
True
:
In [185]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=True)
509 微秒 ± 687 纳秒每次循环(均值 ± 7 次,1,000 次循环的标准偏差)
In [186]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
3.31 毫秒 ± 25.5 微秒每次循环(均值 ± 7 次,100 次循环的标准偏差)
path
看起来对于 True
和 'optimal' 是相同的,但时间差异很大。我不知道为什么。
使用预先计算的路径:
In [187]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=path)
291 微秒 ± 1.07 微秒每次循环(均值 ± 7 次,1,000 次循环的标准偏差)
In [188]: path
Out[188]: ['einsum_path', (0, 1), (0, 1), (0, 1), (0, 1)]
由于所有维度都相同,我不确定路径是否会有很大的影响。也就是说,所有的 Y
都是相同的,所以可以以任何方式排序。无论如何,path
的计算在某种程度上是启发式的,并不能保证对所有大小都是最优的。正如你发现的,有一个重要的缩放因子,部分是由于计算路径的成本,部分(我怀疑)是由于中间数组的大小以及由此产生的内存管理问题。
einsum_path
解释了不同的选择:
- 如果给定一个以 "einsum_path" 开头的列表,将其用作缩并路径。
- 如果为 False,不进行优化。
- 如果为 True,默认使用 "greedy" 算法。
- "optimal" 使用一种组合地探索所有可能的张量缩并方式并选择成本最低的路径的算法。随着缩并项数量的指数级增长。
- "greedy" 算法在每一步选择最佳的缩并对。实际上,该算法在每一步搜索最大的内积、Hadamard 等,并进行外积运算。随着缩并项数量的立方级增长。对于大多数缩并来说,与 "optimal" 路径等效。
好的,这解释了为什么 "optimal" 较慢 - 它在寻找更多的替代方案。在这些维度下,"True"、"greedy" 和 "optimal" 最终都采用相同的路径。
英文:
With a small N=10
(my machine isn't large enough to use 100 (8Mb arrays)
But here some quick timings:
Precompute the path:
In [183]: path=np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)[0]
default:
In [184]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=False)
2.12 ms ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
True
:
In [185]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=True)
509 µs ± 687 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [186]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize='optimal')
3.31 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
path
appears to be same for True
and 'optimal', but the times are quite different. I don't know why.
Using the precomputed path:
In [187]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y,optimize=path)
291 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [188]: path
Out[188]: ['einsum_path', (0, 1), (0, 1), (0, 1), (0, 1)]
Since all dimensions are the same, I'm not sure that path makes a big difference. That is all Y
are the same, so can be ordered in any way. Anyways, the path
calculation is, to some degree or other, hueristic, and not guaranteed to be optimal for all sizes. As you found there's an important scaling factor, partly due to the cost of calculating the path, partly (I suspect) due to the size of the intermediate arrays, and the resulting memory management issues.
einsum_path
explains the alternatives:
* if a list is given that starts with ``einsum_path``, uses this as the
contraction path
* if False no optimization is taken
* if True defaults to the 'greedy' algorithm
* 'optimal' An algorithm that combinatorially explores all possible
ways of contracting the listed tensors and choosest the least costly
path. Scales exponentially with the number of terms in the
contraction.
* 'greedy' An algorithm that chooses the best pair contraction
at each step. Effectively, this algorithm searches the largest inner,
Hadamard, and then outer products at each step. Scales cubically with
the number of terms in the contraction. Equivalent to the 'optimal'
path for most contractions.
Ok, this explains why 'optimal' is slower - it's looking for more alternatives. Given these dimensions True
, 'greedy' and 'optimal' end up with the same path.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论