Cuda implementation unexpectedly too slow with Numba (Python)

huangapple go评论69阅读模式
英文:

Cuda implementation unexpectedly too slow with Numba (Python)

问题

I've translated the content you provided. Here's the translated text:

我正在尝试优化机器人领域中随机正向运动学方程的性能。从本质上讲,这个计算函数接受6个关节值作为输入,并产生一个4x4的数组,表示机器人末端执行器的位置和方向。数组的每个元素可以独立计算,并涉及对6个输入值的大量乘法、正弦和余弦运算的组合。更进一步地,考虑到N个配置,每个配置有6个关节,核心函数将接收一个(N,6)的输入数组,并返回一个(N,4,4)的数组。在本帖的末尾提供了工作示例。

我的目标是比较这些不同实现的性能,并找出使用Numba加速正向运动学计算时的速度限制。我还包括了对2_000_000 x 6数组的执行时间:

  • @numba.guvectorize(使用target=cpu/parallel/cuda
    • cpu:92毫秒
    • parallel:21毫秒
    • cuda:200毫秒(仅处理时间)
  • @cuda.jit:160毫秒(仅处理时间)

我正在寻求改进我的CUDA实现以提高正向运动学计算性能的建议。关于基于CUDA的方法:

  • @guvectorizetarget="cuda":我认为这个实现已经优化,因为它是使用最小的输入维度编写的,最大化了CUDA的工作线程数。@guvectorize装饰器对进一步改进提供的灵活性似乎不大。

  • @cuda.jit:我怀疑这个实现还有改进的空间,因为我对CUDA编程相对不熟悉。以下是我考虑过的一些想法:

    • [已实现] 启动N个具有16个线程的块,并使用条件语句将每个线程分配给计算4x4输出数组中的特定值。但我知道使用条件语句会破坏CUDA范例,可能不是高效的方法。

    • 启动单独的内核来计算4x4数组中的每个值。这将消除所有条件语句。但由于大量内核启动,这种方法可能引入显著的开销。

因此:

  • 这种类型的计算适合GPU吗?我有疑虑。虽然它在计算上很密集,但输出数组(4x4)的每个位置需要执行完全不同的指令。
  • 关于使用@guvectorizetarget="cuda"的实现,有没有可能使它更快?
  • 关于使用@cuda.jit的实现,更有经验的人通常会如何处理这种类型的计算?

使用的工作代码(需要安装Numba和cuda):

from time import perf_counter

import numpy as np
from numba import bool_, cuda, float64, guvectorize

# 以下是你提供的Python代码,请注意这只是代码的翻译部分
# 如果你需要特定的帮助或解释,请告诉我。

如果你需要进一步的翻译或解释,请告诉我。

英文:

I am attempting to optimize the performance of a random Forward Kinematics equation in the field of robotics. In essence, this computational function accepts 6 joint values as input and produces a 4x4 array that represents the position and orientation of the robot's end effector. Each element of the array can be computed independently and involves a large combination of multiplications, sines, and cosines applied to the 6 input values. To generalize further, considering N configurations, each with 6 joints, the core function will receive a (N,6) input array and return a (N,4,4) array. Working examples are provided in the end of this post.

My goal is to compare the performance of these different implementations and find the speed limit when accelerating the forward kinematics calculation with Numba. I also included tiem execution for a 2_000_000 x 6 array:

  • @numba.guvectorize (with target=cpu/parallel/cuda)
    • cpu: 92ms
    • parallel: 21ms
    • cuda: 200ms (only processing time)
  • @cuda.jit: 160ms (only processing time)

I am seeking suggestions to improve the performance of my CUDA implementation for the forward kinematics calculation. About the CUDA-based approaches:

  • @guvectorize with target="cuda": I believe this implementation is already optimized as it is written with the smallest input dimensions, maximizing the number of workers for CUDA. The @guvectorize decorator doesn't provide much flexibility for further improvements I think.

  • @cuda.jit: I suspect there is room for improvement in this implementation, as I am relatively new to CUDA programming. Here are a few ideas I have considered:

    • [IMPLEMENTED ONE] Launching N blocks with 16 threads each and using conditional statements to assign each thread to compute a specific value in the 4x4 output array. However, I am aware that using conditional statements breaks the CUDA paradigm and may not be efficient.

    • Launching separate kernels to compute each value in the 4x4 array. This would remove all if statements. However, this approach may introduce significant overhead due to the large number of kernel launches.

Therefore:

  • Is this type of computation suitable for GPU? I have my doubts. Although it is computationally intensive, each position of the output array (4x4) requires the execution of totally different instructions.
  • About the implementation with @guvectorize with target="cuda", is there a possibility to make it quicker?
  • About the implementation with @cuda.jit, what is the common approach people with more experience would address these type of computations?

Working code used (requires Numba and cuda installed):

from time import perf_counter

import numpy as np
from numba import bool_, cuda, float64, guvectorize


@guvectorize(
    [(float64[:], bool_[:, :], float64[:, :])],
    "(n), (m,m) -> (m,m)",
    nopython=True,
    target="cuda",
)
def guvectorized(joints, dummy, res):
    q1, q2, q3, q4, q5, q6 = joints
    d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
    c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
    s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
    s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)

    res[0,0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
        s1 * c4 - c1 * c23 * s4
    )
    res[1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
        c1 * c4 - s1 * c23 * s4
    )
    res[2,0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
    res[0,1] = s6 * (
        c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
    )
    res[1,1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
        c1 * c4 + s1 * c23 * s4
    )
    res[2,1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
    res[0,2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
    res[1,2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
    res[2,2] = s23 * c4 * c5 - c23 * c5
    res[0,3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
        a1 + a2 * c2 + a3 * c23 + d4 * s23
    )
    res[1,3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
        a1 + a2 * c2 + a3 * c23 + d4 * s23
    )
    res[2,3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
    res[3, 0] = 0.0
    res[3, 1] = 0.0
    res[3, 2] = 0.0
    res[3, 3] = 1.0

@cuda.jit
def cuda_jit(joints, res):
    i = cuda.threadIdx.x
    block_id = cuda.blockIdx.x
    q1, q2, q3, q4, q5, q6 = joints[block_id]
    d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
    c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
    s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
    s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)


    if i == 0:
        res[block_id, 0, 0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
            s1 * c4 - c1 * c23 * s4
        )
    elif i == 1:
        res[block_id, 1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
            c1 * c4 - s1 * c23 * s4
        )
    elif i == 2:
        res[block_id, 2, 0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
    elif i == 3:
        res[block_id, 0, 1] = s6 * (
            c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
        )
    elif i == 4:
        res[block_id, 1, 1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
            c1 * c4 + s1 * c23 * s4
        )
    elif i == 5:
        res[block_id, 2, 1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6
    elif i == 6:
        res[block_id, 0, 2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
    elif i == 7:
        res[block_id, 1, 2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
    elif i == 8:
        res[block_id, 2, 2] = s23 * c4 * c5 - c23 * c5
    elif i == 9:
        res[block_id, 0, 3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
    elif i == 10:
        res[block_id, 1, 3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
            a1 + a2 * c2 + a3 * c23 + d4 * s23
        )
    elif i == 11:
        res[block_id, 2, 3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
    elif i == 12:
        res[block_id, 3, 0] = float64(0.0)
        res[block_id, 3, 1] = float64(0.0)
        res[block_id, 3, 2] = float64(0.0)
        res[block_id, 3, 3] = float64(1.0)


if __name__ == "__main__":
    # Setup
    N = 1_000_000
    # Guvectorized arrays
    input_arr = np.random.rand(N, 6)
    dummy_arr = np.empty((4, 4), dtype=np.bool_)
    out = guvectorized(input_arr, dummy_arr)

    # Cuda.jit arrays
    input_arr_cu = cuda.to_device(input_arr)
    output_arr_cu = cuda.device_array((N, 4, 4))
    threads_per_block = 13
    blocks_per_grid = N
    cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)

    # Timing
    #   Guvectorize
    init = perf_counter()
    guvectorized(input_arr, dummy_arr)
    print(f"Time `guvectorize`: {perf_counter() - init}")

    #   Cuda jit
    init = perf_counter()
    cuda_jit[blocks_per_grid, threads_per_block](input_arr_cu, output_arr_cu)
    cuda.synchronize()
    print(f"Time `cuda.jit`: {perf_counter() - init}")

答案1

得分: 1

通常情况下,CUDA编程范式希望每个线程都执行相同的操作,包含大量的浮点运算和高计算与内存带宽比例。如果你满足这些条件,那么你的用例有很大可能性从使用GPU中受益。根据这些参考点来看,你的用例似乎不是天然不适合使用GPU,但你的代码情况另当别论。

我建议从类似这样的代码开始:

@cuda.jit
def cuda_jit(joints, res):
    # 线程ID
    Tid = numba.cuda.grid(1)

    # ...(以下为一系列计算,创建结果矩阵的条目)...

# 注意:此代码仅为示例,实际使用时需要根据你的需求进行调整。

这样的代码将消除所有分支分歧,让你更充分地利用GPU硬件,前提是选择一个明智的块大小(基本上是32的倍数,详细信息请参阅这里)。

这是否是最优解?不是,因为存在内存合并问题。要解决这个问题,需要进行更复杂的优化,但是我怀疑你目前可能不需要这么复杂的优化。但是这应该能让你的性能比现在好得多。

英文:

In general, the CUDA programming paradigm likes every thread to do the same thing, with lots of floating point operations and a high computation to memory bandwidth ratio. If you have that, then your use case stands a good chance of seeing some benefit from using a GPU. Your use case doesn’t look like an intrinsically poor fit, using those points of reference. Your code, however, is another story.

I would start with something like this:

@cuda.jit
def cuda_jit(joints, res):
Tid = numba.cuda.grid(1)
q1, q2, q3, q4, q5, q6 = joints[Tid]
d1, a1, a2, a3, d4, d6 = float64(10.), float64(2.), float64(3.), float64(3.), float64(2.), float64(5.)
c1, c2, c4, c5, c6 = np.cos(q1), np.cos(q2), np.cos(q4), np.cos(q5), np.cos(q6)
s1, s2, s4, s5, s6 = np.sin(q1), np.sin(q2), np.sin(q4), np.sin(q5), np.sin(q6)
s23, c23 = np.sin(q3 + q2), np.cos(q2 + q3)
res[Tid, 0, 0] = c6 * (c5 * (c1 * c23 * c4 + s1 * s4) - c1 * s23 * s5) + s6 * (
s1 * c4 - c1 * c23 * s4
)
res[Tid, 1, 0] = c6 * (c5 * (s1 * c23 * c4 + c1 * s4) - s1 * s23 * s5) - s6 * (
c1 * c4 - s1 * c23 * s4
)
res[Tid, 2, 0] = c6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * s6
res[Tid, 0, 1] = s6 * (
c1 * s23 * s5 - c5 * (s1 * c23 * c4 + s1 * s4) + c6 * (s1 * c4 - c1 * c23 * s4)
)
res[Tid, 1, 1] = s6 * (s1 * s23 * s5 - c5 * (s1 * c23 * c4 + c1 * s4)) - c6 * (
c1 * c4 + s1 * c23 * s4
)
res[Tid, 2, 1] = -s6 * (c23 * s5 + s23 * c4 * c5) - s23 * s4 * c6  
res[Tid, 0, 2] = s5 * (c1 * c23 * c4 + s1 * s4) + c1 * s23 * c5
res[Tid, 1, 2] = s5 * (s1 * c23 * c4 - c1 * s4) + s1 * s23 * c5
res[Tid, 2, 2] = s23 * c4 * c5 - c23 * c5          
res[Tid, 0, 3] = d6 * (s5 * (c1 * c23 * c4 + s1 + s4) + c1 * s23 * c5) + c1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
res[Tid, 1, 3] = d6 * (s5 * (s1 * c23 * c4 + c1 * s4) + s1 * s23 * c5) + s1 * (
a1 + a2 * c2 + a3 * c23 + d4 * s23
)
res[Tid, 2, 3] = a2 * s2 + d1 + a3 * s23 - d4 * c23 + d6 * (s23 * c4 * s5 - c23 * c5)
res[Tid, 3, 0] = float64(0.0)
res[Tid, 3, 1] = float64(0.0)
res[Tid, 3, 2] = float64(0.0)
res[Tid, 3, 3] = float64(1.0)

[code written on a phone, no guarantees implied, use at own risk]

I.e. just have each thread create all the entries of one result matrix. This will get rid of all the branch divergence and let you achieve much higher utilisation of the GPU hardware if you select a sensible block size (basically round multiples of 32, see here for more details).

Is it optimal? No, because of memory coalescing issues. The next levels of optimisation to fix that are hard (and the Numba compiler is very feature poor compared to the NVidia C++ compiler), and I suspect you don’t want hard at this point. But that should get you much better performance than you are getting now.

huangapple
  • 本文由 发表于 2023年6月15日 15:56:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76480284.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定