英文:
CUDA AtomicCAS Deadlock
问题
I have translated the code part for you. Here's the translated code:
#include <iostream>
using namespace std;
__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
int index = threadIdx.x + blockIdx.x * blockDim.x; // 线程 ID
int ind = indices[index]; // 要增加的目标数组 A 的索引
if (index < nof_indices) {
while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
matrix[ind] += 1;
atomicExch(&d_semaphores[ind], 0);
__syncthreads();
}
}
int main()
{
int nof_indices = 6; // 数组 B 的长度
int indices[6] = { 0,1,2,3,4,1 }; // 数组 B;存储要增加的数组 A 的索引
int canvas[10]; // 数组 A
int semaphores[10]; // 为数组 A 中的每个元素设置独立互斥锁的互斥数组
int* d_canvas;
int* d_indices;
int* d_semaphores;
memset(canvas, 0, sizeof(canvas)); // 将数组 A 的所有元素设置为 0
memset(semaphores, 0, sizeof(semaphores)); // 将互斥数组的所有元素设置为 0
cudaMalloc(&d_canvas, sizeof(canvas));
cudaMalloc(&d_semaphores, sizeof(semaphores));
cudaMalloc(&d_indices, sizeof(indices));
cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);
add_kernel <<<1, 6>>> (d_canvas, d_indices, d_semaphores, nof_indices);
cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);
for (int it = 0; it < nof_indices; it++) {
cout << canvas[it] << endl;
}
cudaFree(d_canvas);
cudaFree(d_indices);
cudaFree(d_semaphores);
return 0;
}
If you have any further questions or need assistance with this code, please feel free to ask.
英文:
I have an array matrix
with values of 0, and I want to increment some of it's elements by 1. The indices of matrix
which I want to increment are stored in array indices
. I need to increment some elements several times, thus I'm trying to use an array of mutexes for each of elements in matrix
. But when I launch my code, the program hangs and I get deadlock.
I'm stuck with this issue. What I ultimately want to do is to draw a continuous brush stroke that overlaps itself using CUDA, thus I need to access the same pixels of canvas image in parallel.
Here is my code:
#include <iostream>
using namespace std;
__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
int ind = indices[index]; // indices of target array A to increment
if (index < nof_indices) {
while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
matrix[ind] += 1;
atomicExch(&d_semaphores[ind], 0);
__syncthreads();
}
}
int main()
{
int nof_indices = 6; // length of an array B
int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
int canvas[10]; // array A
int semaphores[10]; // mutex array with individual mutexes for each of array A elements
int* d_canvas;
int* d_indices;
int* d_semaphores;
memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0
cudaMalloc(&d_canvas, sizeof(canvas));
cudaMalloc(&d_semaphores, sizeof(semaphores));
cudaMalloc(&d_indices, sizeof(indices));
cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);
add_kernel << <1, 6 >> > (d_canvas, d_indices, d_semaphores, nof_indices);
cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);
for (int it = 0; it < nof_indices; it++) {
cout << canvas[it] << endl;
}
cudaFree(d_canvas);
cudaFree(d_indices);
cudaFree(d_semaphores);
return 0;
}
In this example the resulting array matrix
should look like this : {1, 2 ,1 ,1,1,0}
, but I only get it when I run kernel with dimensions << 6,1 >>
.
I'm using CUDA 12.1, Geforce RTX 3060
Thank you
( It only works when I set thread per block size to 1, but it's not what I want )
答案1
得分: 3
In a pre-volta execution model, this line of code is/would have been problematic:
在前伏尔泰执行模型中,这行代码会有问题:
while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
在这篇博客和一些Stack Overflow问题中,通常会对这个话题进行讨论,如这个问题和这个问题。
然而,正如博客中所指出的(以及其他地方),伏尔泰执行模型应该允许更灵活的范式。我认为这里的问题是由于nvcc
的一个特性引起的:
为了在实施“独立线程调度”的纠正操作时提供迁移支持,Volta开发者可以使用编译器选项组合-arch=compute_60 -code=sm_70来选择Pascal的线程调度。
如果你为前伏尔泰架构编译,那么你告诉编译器你希望使用前伏尔泰的语义。这可能会影响你的代码的执行行为,例如,在你在伏尔泰或更新的架构上执行,但编译为前伏尔泰目标时。
根据我的测试,如果我在CUDA 12.1上使用默认开关编译,它将在sm_75
上死锁,因为默认情况下选择了sm_52
目标(包括PTX)。但是,如果我为sm_75
目标编译,代码将以“正常”方式运行。
我认为,如果你为Volta或更新的目标编译,你的代码不会在RTX 3060上死锁。除非你有不同的原因,一般的建议是编译时指定你希望运行的目标架构。
英文:
In a pre-volta execution model, this line of code is/would have been problematic:
while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
The topic is addressed generally in this blog "Independent Thread Scheduling" and also in various SO questions such as this one and this one.
However, as indicated in the blog (and elsewhere) the volta execution model should allow more flexible paradigms. I believe the problem here is arising due to a feature of nvcc
:
>To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.
If you compile for a pre-volta architecture, you are indicating to the compiler that you want pre-volta semantics. This may have an effect on the execution behavior of your code for example in the case where you are executing on a volta or newer architecture, but compiling for a pre-volta target.
According to my testing the code deadlocks on sm_75
if I compile using default switches on CUDA 12.1, which by default selects a sm_52
target (including PTX). However if I compile for a sm_75
target the code runs "normally".
I think your code will not deadlock on your RTX 3060 if you compile for a Volta or newer target. Unless you have a reason not to, a general recommendation is to compile specifying the target(s) you wish to run on.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论