2023年4月20日 08:10:22go评论101阅读模式

英文:

Optimizing np.einsum calls in Python

问题

I'll provide a translation of the code and relevant information without answering your question.

我有两个NumPy数组：维度为(N, N, N, N)的X和维度为(N, N)的Y。我的目标是尽快评估以下einsum调用：

Z = np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)

为此，我对上述三种实现进行了基准测试，它们产生相同的答案但进行了不同的优化：

def einsum_one(X, Y):
    return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y)
def einsum_two(X, Y):
    return np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
def fast_einsum(X, Y):
    Z = np.einsum('ij,iiii->ij', Y, X)
    W = np.einsum('il,ik->ikl', Y, Y)
    Z = np.einsum('ij,im->ijm', Z, Y)
    Z = np.einsum('ijm,ikl->jklm', Z, W)
    return Z

我得到fast_einsum的方法是通过查看np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1]的输出，输出如下所示：

  完整的缩减:  iiii,ij,ik,il,im->jklm
         Naive缩放:  5
     优化后的缩放:  5
      Naive FLOP计数:  1.638e+05
  优化后的FLOP计数:  6.662e+04
   理论加速比:  2.459
  最大中间值:  4.096e+03个元素
--------------------------------------------------------------------------
缩放                  当前                                剩余
--------------------------------------------------------------------------
   2                 ij,iiii->ij                        ik,il,im,ij->jklm
   3                  il,ik->ikl                          im,ij,ikl->jklm
   3                  ij,im->ijm                            ikl,ijm->jklm
   5               ijm,ikl->jklm                               jklm->jklm

因此，我的期望是：对于大型N，fast_einsum应该是最快的，einsum_one应该是最慢的。对于einsum_two，在小型N时应该有一些开销，但在大型N时应该可以忽略。然而，当我对上述进行基准测试时，我得到了以下时间（数值以微秒为单位）：

N | einsum_one | einsum_two | fast_einsum
4    15.1         949          13.8
8    256          966          87.9
12   1920         1100         683
16   7460         1180         2490
20   21800        1390         7080

因此，我的fast_einsum实现不够优化。我的问题是：如何解释np.einsum_path的输出，以便我可以尽可能快地实现np.einsum，而不必包含optimize='optimal'参数的开销？

英文:

I have two numpy arrays: X of dimension (N,N,N,N) and Y of dimension (N,N). My goal is to evaluate the following einsum call as fast as possible:

Z = np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y)

To do this, I benchmarked three implementations of the above, which yield identical answers but are optimized differently:

def einsum_one(X, Y):
    return np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y)
def einsum_two(X, Y):
    return np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y, optimize=&#39;optimal&#39;)
def fast_einsum(X, Y):
    Z = np.einsum(&#39;ij,iiii-&gt;ij&#39;, Y, X)
    W = np.einsum(&#39;il,ik-&gt;ikl&#39;, Y, Y)
    Z = np.einsum(&#39;ij,im-&gt;ijm&#39;, Z, Y)
    Z = np.einsum(&#39;ijm,ikl-&gt;jklm&#39;, Z, W)
    return Z

The way I got fast_einsum was by seeing the output of np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')[1], which was:

  Complete contraction:  iiii,ij,ik,il,im-&gt;jklm
         Naive scaling:  5
     Optimized scaling:  5
      Naive FLOP count:  1.638e+05
  Optimized FLOP count:  6.662e+04
   Theoretical speedup:  2.459
  Largest intermediate:  4.096e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   2                 ij,iiii-&gt;ij                        ik,il,im,ij-&gt;jklm
   3                  il,ik-&gt;ikl                          im,ij,ikl-&gt;jklm
   3                  ij,im-&gt;ijm                            ikl,ijm-&gt;jklm
   5               ijm,ikl-&gt;jklm                               jklm-&gt;jklm

So, my expectation is as follows: for large N, fast_einsum should be the fastest, and einsum_one should be the slowest. For einsum_two there should be some overhead which is noticeable for small N but is negligible for large N. However, when I benchmarked the above, I obtained the following times (numerical values are in microseconds):

N | einsum_one | einsum_two | fast_einsum
4    15.1         949          13.8
8    256          966          87.9
12   1920         1100         683
16   7460         1180         2490
20   21800        1390         7080

So my implementation of fast_einsum is not optimal. So my question is: how do I interpret the output of np.einsum_path so that I can implement np.einsum as fast as possible without having to include the overhead of the optimize='optimal' parameter?

答案1

得分: 1

Here's the translated portion of the text:

在一个小的 N=10（我的计算机不够大，无法使用100（8Mb的数组））

但是这里有一些快速的计时：

预先计算路径：

In [183]: path=np.einsum_path('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=True)[0]

默认：

In [184]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=False)
2.12 毫秒 ± 4.3 微秒每次循环（均值 ± 7 次，100 次循环的标准偏差）

True：

In [185]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=True)
509 微秒 ± 687 纳秒每次循环（均值 ± 7 次，1,000 次循环的标准偏差）
In [186]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize='optimal')
3.31 毫秒 ± 25.5 微秒每次循环（均值 ± 7 次，100 次循环的标准偏差）

path 看起来对于 True 和 'optimal' 是相同的，但时间差异很大。我不知道为什么。

使用预先计算的路径：

In [187]: timeit M=np.einsum('iiii,ij,ik,il,im->jklm', X, Y, Y, Y, Y, optimize=path)
291 微秒 ± 1.07 微秒每次循环（均值 ± 7 次，1,000 次循环的标准偏差）
In [188]: path
Out[188]: ['einsum_path', (0, 1), (0, 1), (0, 1), (0, 1)]

由于所有维度都相同，我不确定路径是否会有很大的影响。也就是说，所有的 Y 都是相同的，所以可以以任何方式排序。无论如何，path 的计算在某种程度上是启发式的，并不能保证对所有大小都是最优的。正如你发现的，有一个重要的缩放因子，部分是由于计算路径的成本，部分（我怀疑）是由于中间数组的大小以及由此产生的内存管理问题。

einsum_path 解释了不同的选择：

如果给定一个以 "einsum_path" 开头的列表，将其用作缩并路径。
如果为 False，不进行优化。
如果为 True，默认使用 "greedy" 算法。
"optimal" 使用一种组合地探索所有可能的张量缩并方式并选择成本最低的路径的算法。随着缩并项数量的指数级增长。
"greedy" 算法在每一步选择最佳的缩并对。实际上，该算法在每一步搜索最大的内积、Hadamard 等，并进行外积运算。随着缩并项数量的立方级增长。对于大多数缩并来说，与 "optimal" 路径等效。

好的，这解释了为什么 "optimal" 较慢 - 它在寻找更多的替代方案。在这些维度下，"True"、"greedy" 和 "optimal" 最终都采用相同的路径。

英文:

With a small N=10 (my machine isn't large enough to use 100 (8Mb arrays)

But here some quick timings:

Precompute the path:

In [183]: path=np.einsum_path(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y,optimize=True)[0]

default:

In [184]: timeit M=np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y,optimize=False)
2.12 ms &#177; 4.3 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

True:

In [185]: timeit M=np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y,optimize=True)
509 &#181;s &#177; 687 ns per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)
In [186]: timeit M=np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y,optimize=&#39;optimal&#39;)
3.31 ms &#177; 25.5 &#181;s per loop (mean &#177; std. dev. of 7 runs, 100 loops each)

path appears to be same for True and 'optimal', but the times are quite different. I don't know why.

Using the precomputed path:

In [187]: timeit M=np.einsum(&#39;iiii,ij,ik,il,im-&gt;jklm&#39;, X, Y, Y, Y, Y,optimize=path)
291 &#181;s &#177; 1.07 &#181;s per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)
In [188]: path
Out[188]: [&#39;einsum_path&#39;, (0, 1), (0, 1), (0, 1), (0, 1)]

Since all dimensions are the same, I'm not sure that path makes a big difference. That is all Y are the same, so can be ordered in any way. Anyways, the path calculation is, to some degree or other, hueristic, and not guaranteed to be optimal for all sizes. As you found there's an important scaling factor, partly due to the cost of calculating the path, partly (I suspect) due to the size of the intermediate arrays, and the resulting memory management issues.

einsum_path explains the alternatives:

* if a list is given that starts with ``einsum_path``, uses this as the
  contraction path
* if False no optimization is taken
* if True defaults to the &#39;greedy&#39; algorithm
* &#39;optimal&#39; An algorithm that combinatorially explores all possible
  ways of contracting the listed tensors and choosest the least costly
  path. Scales exponentially with the number of terms in the
  contraction.
* &#39;greedy&#39; An algorithm that chooses the best pair contraction
  at each step. Effectively, this algorithm searches the largest inner,
  Hadamard, and then outer products at each step. Scales cubically with
  the number of terms in the contraction. Equivalent to the &#39;optimal&#39;
  path for most contractions.

Ok, this explains why 'optimal' is slower - it's looking for more alternatives. Given these dimensions True, 'greedy' and 'optimal' end up with the same path.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Optimizing np.einsum calls in Python.

问题

答案1

如何使Plotly Python子图具有相同的颜色和相应数据的图例？

有没有一种比O(n)更快的方法来反转一个字符串？

无法在Fedora 37上安装easyocr和torch。

使用属性为yolov8中的标签添加属性。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。