英文:
Speed-up a variable-dependent matrix inverse computation in Python
问题
我有一个迭代性问题,在每次迭代中,我需要执行一系列矩阵操作,我想要加快速度。大部分计算涉及常数变量,除了一个变化的变量。
我的问题是:是否可以构建这个操作的“骨架”,以加速第一次后的迭代计算?(与CVXPY中优化问题的参数化热启动类似)。
我需要处理的代码目前在一个类的专用函数中,类似于以下内容:
# 一些常量定义(在一个类中)
m = 5501
Y = np.random.rand(10, m)
U = np.random.rand(75, m)
sig_on = 0.15
sig_off = 0.25
a = 15
L = 25
def obtain_A_B(x):
# 为了可读性,我省略了内部变量的声明,如 a = self.a
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
F = varying_part * np.identity(m) + Y.T @ Y
Fi = np.linalg.inv(F)
UFiUTi = np.linalg.inv(U @ Fi @ U.T)
A = (Fi - Fi @ U.T @ UFiUTi @ U @ Fi) @ Y.T
B = Fi @ U.T @ UFiUTi
return A, B
x = np.random.rand(10)
result = obtain_A_B(x)
我想加速计算A和B,而“varying_part”返回一个标量,在每次迭代中给定x会变化。
由于这里真正耗时的部分是计算逆矩阵Fi,仅仅加速这部分计算已经很有帮助,将逆矩阵移到专用变量中(如上所述)已将执行时间从约41秒降低到约10秒,但如果我能让执行速度更快,那将是最好的(这就是“操作骨架”的想法产生的地方)。
英文:
I have an iterative problem where at each iteration I need to perform a series of matrix operations, which I want to speed up. The majority of the calculation involves constant variables, except for a varying one.
My question is: is it possible to build a "skeleton" of this operation, as to speed up the calculation in the iterations after the first? (somehow similar to the parametric warm start of the optimization problem in CVXPY).
The code I would need to work on is currently within a dedicated function in a Class, and it is something like this:
# some constants definitions (in a class)
m = 5501
Y = np.random.rand(10, m)
U = np.random.rand(75, m)
sig_on = 0.15
sig_off = 0.25
a = 15
L = 25
def obtain_A_B(x):
# for readability I omit internal variable declarations as a = self.a
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
F = varying_part * np.identity(m) + Y.T @ Y
Fi = np.linalg.inv(F)
UFiUTi = np.linalg.inv(U @ Fi @ U.T)
A = (Fi - Fi @ U.T @ UFiUTi @ U @ Fi) @ Y.T
B = Fi @ U.T @ UFiUTi
return A, B
x = np.random.rand(10)
result = obtain_A_B(x)
I am interested in speeding up the computation of A and B, and 'varying_part' returns a scalar which changes at each iteration given x.
Since the really time-consuming part here is computing the inverse Fi, already just speeding that computation up would be a great help.
I have already moved the inverses to dedicated variables (as written above) to reduce the number of times these need to be computed - that brought down the execution time from ~41s to ~10s - but it would be best if I could make the execution even faster (that's where the "operation-skeleton" idea came to mind).
答案1
得分: 1
基本优化
首先,B
的表达式在 A
中使用,因此可以首先计算 B
并在 A
的表达式中重复使用它。
此外,可以更高效地计算 varying_part * np.identity(m) + Y.T @ Y
。实际上,varying_part * np.identity(m)
创建了两个大的临时数组,填充速度较慢。这是因为与现代 CPU 的性能相比,RAM 相对较慢(参见“Memory Wall”)。您可以只更新数组的对角线。
另外,计算 B @ (U @ Fi)
要比 (B @ U) @ Fi
快得多(它等同于 B @ U @ Fi
),尽管两者在数学上是等效的,因为矩阵乘法是结合的(尽管在数值上略有不同)。
以下是优化后的实现:
def obtain_A_B_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
F = Y.T @ Y
np.fill_diagonal(F, F.diagonal()+varying_part)
Fi = np.linalg.inv(F)
tmp = Fi @ U.T
UFiUTi = np.linalg.inv(U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (U @ Fi)) @ Y.T
return A, B
这种优化可以在我的计算机上提高代码速度约 55%。在np.linalg.inv(F)
函数调用中,几乎有 90% 的时间被消耗。
不好的消息是,矩阵求逆在 CPU 上很难进行优化。Numpy 中用于执行此计算的默认 OpenBLAS 库效率不高,但它相对不错,因为它使用相对优化的本机 C 代码并行运行。此外,改变对角线会强烈影响计算,因此我认为不可能编写逆增量计算。事实上,通过增加 F[0,0]
1,会导致高斯约当算法中所有行的权重都必须重新计算(以及所有后续的权重计算)。
要编写一个速度更快的代码而不降低浮点精度或算法的准确性(例如近似法),这很困难。如果您有服务器 GPU 或者可以降低精度,GPU 可以显著提高性能。
基于 GPU 的实现
可以使用 GPU 使此代码在目标平台和输入浮点类型方面更快。离散 GPU 的计算能力往往显著高于主流 CPU。然而,对于主流 PC GPU 上的双精度来说,情况并非如此。实际上,PC GPU 专为简单精度浮点数据进行优化(在游戏和3D应用程序中常用)。要进行快速双精度计算,需要服务器端 GPU。此外,应该记住,CPU 和 GPU 之间的数据传输是昂贵的,但矩阵求逆是瓶颈,并且运行在 O(N**3)
,因此在这里数据传输是可以忽略不计的。Cupy 可以用于轻松编写基于 GPU 的实现(适用于 Nvidia GPU)。以下是结果的 GPU 代码:
import cupy as cp
def obtain_A_B_gpu_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
# 将数据发送到 GPU
device_Y = cp.array(Y)
device_U = cp.array(U)
# 基于 GPU 的计算
F = device_Y.T @ device_Y
cp.fill_diagonal(F, F.diagonal()+varying_part)
Fi = cp.linalg.inv(F)
tmp = Fi @ device_U.T
UFiUTi = cp.linalg.inv(device_U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (device_U @ Fi)) @ device_Y.T
# 从 GPU 获取数据
return A.get(), B.get()
请注意,Cupy 在第一次执行时可能会相当慢,但之后应该会快得多(这对于迭代循环来说是可以接受的)。
实验结果
这是在我的计算机上(i5-9600KF CPU 和 Nvidia 1660 Super GPU,主流离散 PC GPU)的结果:
双精度:
- obtain_A_B: 4.10 秒
- obtain_A_B_opt: 2.64 秒
- obtain_A_B_gpu_opt: 3.03 秒
简单精度:
- obtain_A_B: 3.17 秒
- obtain_A_B_opt: 2.52 秒
- obtain_A_B_gpu_opt: 0.28 秒 <----
可以看到,基于 GPU 的实现在简单精度浮点输入中明显更快。这是因为我的 GPU 针对此进行了优化。在计算服务器上,obtain_A_B_gpu_opt
应该比其他实现要快得多。请注意,cp.linalg.inv
在我的 GPU 上仍然占用了 90% 以上的时间。
英文:
Basic optimizations
First of all, the expression of B
is use in A
so B
can be computed first and reused in the expression of A
.
Moreover, varying_part * np.identity(m) + Y.T @ Y
can be computed more efficiently. Indeed, varying_part * np.identity(m)
creates 2 big temporary arrays that are slow to fill. This is because the RAM is slow nowadays compared to the performance of the CPU (see "Memory Wall"). You can just update the diagonal of the array instead.
Additionally, it is significantly faster to compute B @ (U @ Fi)
than (B @ U) @ Fi
(which is equivalent to B @ U @ Fi
) while both are analytically equivalent since the matrix multiplication is associative (though a bit different numerically).
Here is the optimized implementation:
def obtain_A_B_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
F = Y.T @ Y
np.fill_diagonal(F, F.diagonal()+varying_part)
Fi = np.linalg.inv(F)
tmp = Fi @ U.T
UFiUTi = np.linalg.inv(U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (U @ Fi)) @ Y.T
return A, B
This optimization results in a 55% faster code on my machine. Nearly 90% of the time is spent in the np.linalg.inv(F)
function call.
The bad news is that the matrix inversion can very hardly be optimized on the CPU. The default OpenBLAS library used to perform this computation in Numpy is sub-optimal, but it is relatively good since it runs in parallel using a relatively optimized native C code. Moreover, changing the diagonal strongly impact the computation so I do not thing this is possible to write an incremental computation of the inverse. Indeed, increasing F[0,0]
by 1 causes the weight to of all lines to be mandatory recomputed in the Gauss Jordan algorithm (and so all following weight computations).
It is hard to write a much faster code without reducing the floating-point precision or the accuracy of the algorithm (eg. approximation). GPUs can significantly help if you have a server GPU or if it is fine for you to reduce the precision.
GPU-based implementation
One can use GPUs to make this code faster regarding the target platform and the input floating-point type. Discrete GPUs tend to have a significantly bigger computational power than mainstream CPUs. However, this is not true for double-precision on mainstream PC GPUs. Indeed, PC GPUs are optimized for simple-precision floating point data (commonly used in games and 3D applications). For fast double-precision computations, a server-side GPU is needed. Besides, one should keep in mind that data transfers between the CPU and the GPU are expensive, but the matrix inversion is the bottleneck and it runs in O(N**3)
so data transfers are negligible here. Cupy can be used to easily write a GPU implementation (for Nvidia GPUs). Here is the resulting GPU code:
import cupy as cp
def obtain_A_B_gpu_opt(x):
varying_part = (a * sig_on / np.linalg.norm(x)) ** 2 + L * sig_off
# Send data to the GPU
device_Y = cp.array(Y)
device_U = cp.array(U)
# GPU-based computation
F = device_Y.T @ device_Y
cp.fill_diagonal(F, F.diagonal()+varying_part)
Fi = cp.linalg.inv(F)
tmp = Fi @ device_U.T
UFiUTi = cp.linalg.inv(device_U @ tmp)
B = tmp @ UFiUTi
A = (Fi - B @ (device_U @ Fi)) @ device_Y.T
# Get data back from the GPU
return A.get(), B.get()
Note that Cupy may be quite slow during the first execution, but it should be significantly faster then (which is fine for an iterative loop here).
Experimental results
Here are results on my machine with a i5-9600KF CPU and a Nvidia 1660 Super GPU (mainstream discrete PC GPU):
double-precision:
- obtain_A_B: 4.10 s
- obtain_A_B_opt: 2.64 s
- obtain_A_B_gpu_opt: 3.03 s
simple-precision:
- obtain_A_B: 3.17 s
- obtain_A_B_opt: 2.52 s
- obtain_A_B_gpu_opt: 0.28 s <----
One can see that the GPU-based implementation is significantly faster with simple-precision floating-point inputs. This is because my GPU is optimized for this. On a computing server, obtain_A_B_gpu_opt
should be significantly faster than other implementations. Note that cp.linalg.inv
still takes >90% of the time on my GPU.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论