什么是一个好的CUDA单线程排序算法?

huangapple go评论65阅读模式
英文:

What is a good CUDA single threaded sorting algorithm?

问题

我有一个 CUDA 程序,每个线程都必须对一个小数组(N=49)进行排序。目前我正在使用来自[GeeksforGeeks](https://www.geeksforgeeks.org/heap-sort/)的简单堆排序算法,代码如下:

```cpp
__global__ void kernel(int N){
    ...
    double *x = new double[N];
    fillArray(x, N);
    sort(x, N);

    double y = doSomething(x, N);
    delete[] x;
    ...
}

问题是,线程束分歧很大,排序函数中非活动线程的总数超过80%,这是由于算法中的条件语句所致。

每个线程都有自己的数组,与下一个线程的数组完全不同,这意味着在具有32个线程的线程束中,我几乎肯定会触发每个条件语句。

我的问题是,哪种排序算法能最小化线程束分歧,同时具有 O(1) 的内存复杂度,以便适用于单个 CUDA 线程?

大多数现有的 CUDA 排序示例侧重于使用多个线程和块对大型数组进行排序,但这并不是我在这里寻找的。


<details>
<summary>英文:</summary>

I have a cuda program where each thread has to sort a small array (N=49). Currently I&#39;m using a simple heap sort algorithm from [GeeksforGeeks](https://www.geeksforgeeks.org/heap-sort/) like this:


global void kernel(int N){
...
double *x = new double[N];
fillArray(x, N);
sort(x, N);

double y = doSomething(x, N);
delete[] x;
...

}


The problem is that the warp divergence is large and the total amount of inactive threads in the sorting function is more than 80%, due to the conditional statements in the algorithm.

Each thread has its own array which is completely different from the next thread&#39;s array, which means that I&#39;m almost guaranteed to hit every conditional statement in a warp with 32 threads.

My question is which sorting algorithm minimizes the warp divergence and has memory O(1), so that it is suitable for a single CUDA thread?

Most existing cuda sorting examples focus on sorting huge arrays with multiple threads and blocks, but that is not what I am looking for here. 

</details>


# 答案1
**得分**: 1

我尚未找到一个好的解决方案,但我尝试了一些用于 N=49 的排序算法:bitonicSort、thrust::sort、insertionSort 和 bubbleSort。BitonicSort 需要数组大小为 2^k,因此大小为 49 的数组需要一些填充。以下是我使用这些排序算法进行的程序定时基准测试,但请注意它们可能不代表每种用例的性能:

| 算法 | 时间(秒) |
| -------- | -------------- |
| heapSort    | 46            |
| bitonicSort   | 55           |
| thrust::sort   | 80           |
| insertion sort   | 48           |
| bubble sort   | 55           |

结果相当令人失望。对于整数来说,RadixSort 可能是最快的,但我必须使用双精度,所以无法使用它。Cub 中的排序算法需要辅助内存,我想要避免使用它们(因为我受到内存限制)。当我找到更快的解决方案时,我会更新这个答案。

<details>
<summary>英文:</summary>

I haven&#39;t still found a good solution, but I tried some sorting algorithms for N=49: bitonicSort, thrust::sort, insertionSort and bubbleSort. BitonicSort requires arrays with sizes of 2^k and thus the array of 49 needs some padding.
Here are some timing benchmarks of my program using these sorting algorithms, but note that they may not be representative of every usecase:

| Algorithm | Time(s) |
| -------- | -------------- |
| heapSort    | 46            |
| bitonicSort   | 55           |
| thrust::sort   | 80           |
| insertion sort   | 48           |
| bubble sort   | 55           |

The results are quite disappointing.
RadixSort would probably be fastest for integers, but I have to use doubles so I cannot use it. The sorting algorithms in Cub need auxiliary memory, which I want to avoid (because I am memory limited).

When I find something faster, I will update this answer.

</details>



huangapple
  • 本文由 发表于 2023年6月19日 19:59:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76506418.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定