明确的矩阵乘法比numpy.matmul快得多吗?

huangapple go评论64阅读模式
英文:

Explicit matrix multiplication much faster than numpy.matmul?

问题

在Python代码中,我在某个时候需要分别对两个大型的2x2矩阵列表进行相乘。在代码中,这两个列表都是形状为(n,2,2)的numpy数组。预期结果是另一个形状为(n,2,2)的数组,其中矩阵1是第一个列表的矩阵1与第二个列表的矩阵1相乘的结果,依此类推。

经过一些性能分析,我发现矩阵相乘是性能瓶颈。出于好奇,我尝试了以"显式"方式进行矩阵相乘。下面是一个带有测定运行时的代码示例。

import timeit
import numpy as np

def explicit_2x2_matrices_multiplication(
    mats_a: np.ndarray, mats_b: np.ndarray
) -> np.ndarray:
    matrices_multiplied = np.empty_like(mats_b)
    for i in range(2):
        for j in range(2):
            matrices_multiplied[:, i, j] = (
                mats_a[:, i, 0] * mats_b[:, 0, j] + mats_a[:, i, 1] * mats_b[:, 1, j]
            )

    return matrices_multiplied


matrices_a = np.random.random((1000, 2, 2))
matrices_b = np.random.random((1000, 2, 2))

assert np.allclose( # 检查显式版本是否正确
    matrices_a @ matrices_b,
    explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
)

print(  # 1.1814142999992328 秒
    timeit.timeit(lambda: matrices_a @ matrices_b, number=10000)
)
print(  # 1.1954495010013488 秒
    timeit.timeit(lambda: np.matmul(matrices_a, matrices_b), number=10000)
)
print(  # 2.2304022700009227 秒
    timeit.timeit(lambda: np.einsum('lij,ljk->lik', matrices_a, matrices_b), number=10000)
)
print(  # 0.19581600800120214 秒
    timeit.timeit(
        lambda: explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
        number=10000,
    )
)

如代码中测试的那样,这个函数产生了与常规矩阵__matmul__结果相同的结果。然而,不同的是速度:在我的机器上,显式表达式快了多达10倍。

这对我来说是一个相当令人惊讶的结果。我本来期望numpy表达式会更快,或者至少与较长的Python版本相当,而不是像我们在这里看到的那样慢一个数量级。我对为什么性能差异如此激烈感到好奇。

我正在运行numpy版本1.25,Python版本3.10.6。

英文:

In a Python code, I need at some point to multiply two large lists of 2x2 matrices, individually. In the code, both these lists are numpy arrays with shape (n,2,2). The expected result in another (n,2,2) array, where the matrix 1 is the result of the multiplication between matrix 1 of the first list and matrix 1 of the second list, etc.

After some profiling, I found that matrix multiplication was a performance bottleneck. Out of curiosity, I tried writing the matrix multiplication "explicitly". Below is a code example with measured runtimes.

import timeit
import numpy as np

def explicit_2x2_matrices_multiplication(
    mats_a: np.ndarray, mats_b: np.ndarray
) -> np.ndarray:
    matrices_multiplied = np.empty_like(mats_b)
    for i in range(2):
        for j in range(2):
            matrices_multiplied[:, i, j] = (
                mats_a[:, i, 0] * mats_b[:, 0, j] + mats_a[:, i, 1] * mats_b[:, 1, j]
            )

    return matrices_multiplied


matrices_a = np.random.random((1000, 2, 2))
matrices_b = np.random.random((1000, 2, 2))

assert np.allclose( # Checking that the explicit version is correct 
    matrices_a @ matrices_b,
    explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
)

print(  # 1.1814142999992328 seconds
    timeit.timeit(lambda: matrices_a @ matrices_b, number=10000)
)
print(  # 1.1954495010013488 seconds
    timeit.timeit(lambda: np.matmul(matrices_a, matrices_b), number=10000)
)
print(  # 2.2304022700009227 seconds
    timeit.timeit(lambda: np.einsum('lij,ljk->lik', matrices_a, matrices_b), number=10000)
)
print(  # 0.19581600800120214 seconds
    timeit.timeit(
        lambda: explicit_2x2_matrices_multiplication(matrices_a, matrices_b),
        number=10000,
    )
)

As tested in the code, this function produces the same results as a regular matrix __matmul__ result. However what is not the same is the speed: on my machine, the explicit expression is up to 10 times faster.

This is quite a surprising result to me. I would have expected the numpy expression to be faster or at least comparable to the longer Python version, not an order of magnitude slower as we see here. I would be curious for insights as to why the difference in performance is so drastic.

I am running numpy version 1.25, and Python version 3.10.6.

答案1

得分: 3

以下是翻译好的部分:

TL;DR: 问题中提供的所有方法都非常低效。事实上,Numpy 明显不是最佳选择,只使用 Numpy 没有办法高效计算这个问题。然而,仍然有比问题中提供的方法更快的解决方案。


解释和更快的实现

Numpy 代码使用强大的通用 迭代器 来对多个数组切片应用给定的计算(如矩阵乘法)。这些迭代器在实现广播时非常有用,也可以生成相对简单的 einsum 实现。然而,当迭代次数巨大且数组较小时,这些迭代器也相当昂贵。这正是您的用例所发生的情况。尽管可以通过优化 Numpy 代码来减少 Numpy 迭代器的开销,但在这种特定用例中无法将开销降至可忽略不计的时间。实际上,每个矩阵只需要执行 12 次浮点运算。一个相对较新的 x86-64 处理器可以使用标量 FMA 单元在不到 10 纳秒的时间内计算每个矩阵。事实上,可以使用 SIMD 指令来计算每个矩阵,仅需几纳秒的时间。

<!--
此外,请注意,默认情况下 einsum 没有经过很好的优化。您需要向函数提供 optimized=True 参数以提高性能。-->

首先,我们可以通过自己执行矩阵乘法来大部分消除内部 Numpy 迭代器的开销,操作沿着第一个轴的向量。这正是 explicit_2x2_matrices_multiplication 做的事情!

<!-- 这是最终的代码:

def compute_faster(matrices_a, matrices_b):
    result_matrices = np.empty_like(matrices_a)
    for i in range(2):
        for j in range(2):
            result_matrices[:,i,j] = matrices_a[:,i,0] * matrices_b[:,0,j] + matrices_a[:,i,1] * matrices_b[:,1,j]
    return result_matrices
``` &gt;

虽然 `explicit_2x2_matrices_multiplication` 应该会快得多但它仍然不是最佳选择它执行非连续的内存读取创建了多个无用的临时数组并且每次 Numpy 调用都引入了一些小的开销更快的解决方案是编写 Numba/Cython 代码以便底层编译器可以生成针对 2x2 矩阵进行优化的一系列非常好的指令优秀的编译器甚至可以在这种情况下生成 SIMD 指令这是 Numpy 不可能实现的以下是最终的代码

```python
import numba as nb
@nb.njit(&#39;(float64[:,:,::1], float64[:,:,::1])&#39;)
def compute_fastest(matrices_a, matrices_b):
    assert matrices_a.shape == matrices_b.shape
    assert matrices_a.shape[1] == 2 and matrices_a.shape[2] == 2

    n = matrices_a.shape[0]
    result_matrices = np.empty((n, 2, 2))
    for k in range(n):
        for i in range(2):
            for j in range(2):
                result_matrices[k,i,j] = matrices_a[k,i,0] * matrices_b[k,0,j] + matrices_a[k,i,1] * matrices_b[k,1,j]

    return result_matrices
英文:

TL;DR: all the methods provided in the question are very inefficient. In fact, Numpy is clearly sub-optimal and there is no way to compute this efficiently only with Numpy. Still, there are faster solutions than the ones provided in the question.


Explanation and faster implementation

The Numpy code make use of powerful generic iterators to apply a given computation (like a matrix multiplication) to multiple array slice. Such iterators are useful to implement broadcasting and also to generate a relatively simple implementation of einsum. The thing is they are also quite expensive when the number of iteration is huge and the array are small. This is exactly what happen in your use-case. While the overhead of Numpy's iterators can be mitigated by optimizing the Numpy code, there is no way to reduce the overhead to a negligible time in this specific use-case. Indeed, there is only 12 floating-point operations to perform per matrix. A relatively recent x86-64 processor can be able to compute each matrix in less than 10 nanoseconds using scalar FMA units. In fact, one can use SIMD instruction so to compute each matrix in only few nanosecond.

<!--
Besides note that einsum is not very optimized by default. You need to provide the optimized=True argument to the function so it can be faster. -->

First thing first, we can mostly remove the overhead of the internal Numpy iterator by doing the matrix multiplication ourself operating on vectors along the first axis. This is exactly what explicit_2x2_matrices_multiplication does!

<!-- Here is the resulting code:

def compute_faster(matrices_a, matrices_b):
    result_matrices = np.empty_like(matrices_a)
    for i in range(2):
        for j in range(2):
            result_matrices[:,i,j] = matrices_a[:,i,0] * matrices_b[:,0,j] + matrices_a[:,i,1] * matrices_b[:,1,j]
    return result_matrices

-->

While explicit_2x2_matrices_multiplication should be significantly faster, it is still sub-optimal: it performes non-contiguous memory reads, creates several useless temporary arrays and each Numpy call introduce a small overhead. A faster solution is to write a Numba/Cython code so the underlying compiler can generate a very good sequence of instructions optimized for the 2x2 matrices. Good compilers can even generate SIMD instructions in this case which is something impossible for Numpy. Here is the resulting code:

import numba as nb
@nb.njit(&#39;(float64[:,:,::1], float64[:,:,::1])&#39;)
def compute_fastest(matrices_a, matrices_b):
    assert matrices_a.shape == matrices_b.shape
    assert matrices_a.shape[1] == 2 and matrices_a.shape[2] == 2

    n = matrices_a.shape[0]
    result_matrices = np.empty((n, 2, 2))
    for k in range(n):
        for i in range(2):
            for j in range(2):
                result_matrices[k,i,j] = matrices_a[k,i,0] * matrices_b[k,0,j] + matrices_a[k,i,1] * matrices_b[k,1,j]

    return result_matrices

Performance results

Here are performance results on my machine with a i5-9600KF CPU for 1000x2x2 matrices:

Naive einsum:                           214   &#181;s
matrices_a @ matrices_b:                102   &#181;s
explicit_2x2_matrices_multiplication:    24   &#181;s
compute_fastest:                          2.7 &#181;s   &lt;-----

Discussion

The Numba implementation reach 4.5 GFlops. Each matrix is computed in only 12 CPU cycles (2.7 ns)! My machine is able to reach up to 300 GFlops in practice (theoretically 432 GFlops), but only 50 GFlops with 1 core and 12.5 GFlops using a scalar code (theoretically 18 GFlops). The granularity of the operation is too small for multiple thread to be useful (the overhead of spawning thread is at least few microseconds). Besides, a SIMD code can hardly saturate de FMA units because the input data layout requires SIMD shuffles so 50 GFlops is actually an optimistic upper bound. As a result, we can safely say that the Numba implementation is a pretty efficient implementation. Still, a faster code can be written thanks to SIMD instructions (I expect a speed-up of about x2 in practice). That being said, writing a native code using SIMD intrinsics of helping the compiler to generate a fast SIMD code is really far from being easy (not to mention the final code will be ugly and hard to maintain). Thus, an SIMD implementation might not worth the effort.

答案2

得分: 1

在第一个,'batch' 维度上验证 matmul 的大致线性性:

你的 (1000,2,2) 数组:

In [353]: timeit matrices_a@matrices_b
251 微秒 ± 767 纳秒 每次循环(平均 ± 标准差,7 次循环,1,000 次每次循环)

以及半尺寸和十分之一尺寸:

In [354]: timeit matrices_a[:500]@matrices_b[:500]
129 微秒 ± 783 纳秒 每次循环(平均 ± 标准差,7 次循环,10,000 次每次循环)
In [355]: timeit matrices_a[:100]@matrices_b[:100]
28.7 微秒 ± 532 纳秒 每次循环(平均 ± 标准差,7 次循环,10,000 次每次循环)

你的显式

In [360]: explicit_2x2_matrices_multiplication(matrices_a, matrices_b).shape
Out[360]: (1000, 2, 2)
In [361]: timeit explicit_2x2_matrices_multiplication(matrices_a, matrices_b)
59.9 微秒 ± 1.91 微秒 每次循环(平均 ± 标准差,7 次循环,10,000 次每次循环)

np.einsum 不尝试任何重新排序或其他优化:

In [362]: print(np.einsum_path('ijk,ikl->ijl', matrices_a, matrices_b, optimize='greedy')[1])
  Complete contraction:  ijk,ikl->ijl
         Naive scaling:  4
     Optimized scaling:  4
      Naive FLOP count:  1.600e+04
  Optimized FLOP count:  1.600e+04
   Theoretical speedup:  1.000
  Largest intermediate:  4.000e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   4                ikl,ijk->ijl                                 ijl->ijl
英文:

Verification of the rough linearity of matmul on the first, 'batch' dimension:

Your (1000,2,2) arrays:

In [353]: timeit matrices_a@matrices_b
251 &#181;s &#177; 767 ns per loop (mean &#177; std. dev. of 7 runs, 1,000 loops each)

and with half and tenth size:

In [354]: timeit matrices_a[:500]@matrices_b[:500]
129 &#181;s &#177; 783 ns per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)
In [355]: timeit matrices_a[:100]@matrices_b[:100]
28.7 &#181;s &#177; 532 ns per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

Your explicit

In [360]: explicit_2x2_matrices_multiplication(matrices_a, matrices_b).shape
Out[360]: (1000, 2, 2)
In [361]: timeit explicit_2x2_matrices_multiplication(matrices_a, matrices_b)
59.9 &#181;s &#177; 1.91 &#181;s per loop (mean &#177; std. dev. of 7 runs, 10,000 loops each)

np.einsum doesn't try any reordering or other optimization:

In [362]: print(np.einsum_path(&#39;ijk,ikl-&gt;ijl&#39;,matrices_a, matrices_b, optimize=&#39;greedy
     ...: &#39;)[1])
  Complete contraction:  ijk,ikl-&gt;ijl
         Naive scaling:  4
     Optimized scaling:  4
      Naive FLOP count:  1.600e+04
  Optimized FLOP count:  1.600e+04
   Theoretical speedup:  1.000
  Largest intermediate:  4.000e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   4                ikl,ijk-&gt;ijl                                 ijl-&gt;ijl

huangapple
  • 本文由 发表于 2023年7月20日 18:54:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76729145.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定