2023年5月13日 23:09:25go评论124阅读模式

英文:

CUDA AtomicCAS Deadlock

问题

I have translated the code part for you. Here's the translated code:

#include <iostream>
using namespace std;
__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // 线程 ID
    int ind = indices[index]; // 要增加的目标数组 A 的索引    
    if (index < nof_indices) {
        while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&d_semaphores[ind], 0);
        __syncthreads();
    }
}
int main()
{
    int nof_indices = 6; // 数组 B 的长度
    int indices[6] = { 0,1,2,3,4,1 }; // 数组 B；存储要增加的数组 A 的索引
    int canvas[10]; // 数组 A
    int semaphores[10]; // 为数组 A 中的每个元素设置独立互斥锁的互斥数组
    int* d_canvas;
    int* d_indices;
    int* d_semaphores;
    memset(canvas, 0, sizeof(canvas)); // 将数组 A 的所有元素设置为 0
    memset(semaphores, 0, sizeof(semaphores)); // 将互斥数组的所有元素设置为 0    
    cudaMalloc(&d_canvas, sizeof(canvas));
    cudaMalloc(&d_semaphores, sizeof(semaphores));
    cudaMalloc(&d_indices, sizeof(indices));
    cudaMemcpy(d_canvas, &canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);
    add_kernel <<<1, 6>>> (d_canvas, d_indices, d_semaphores, nof_indices);
    cudaMemcpy(&canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);
    for (int it = 0; it < nof_indices; it++) {
        cout << canvas[it] << endl;
    }
    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);
    return 0;
}

If you have any further questions or need assistance with this code, please feel free to ask.

英文:

I have an array matrix with values of 0, and I want to increment some of it's elements by 1. The indices of matrix which I want to increment are stored in array indices. I need to increment some elements several times, thus I'm trying to use an array of mutexes for each of elements in matrix. But when I launch my code, the program hangs and I get deadlock.

I'm stuck with this issue. What I ultimately want to do is to draw a continuous brush stroke that overlaps itself using CUDA, thus I need to access the same pixels of canvas image in parallel.

Here is my code:

#include &lt;iostream&gt;
using namespace std;
__global__ void add_kernel(int* matrix, int* indices, int* d_semaphores, int nof_indices)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x; // thread id
    int ind = indices[index]; // indices of target array A to increment    
    if (index &lt; nof_indices) {
        while (atomicCAS(&amp;d_semaphores[ind], 0, 1) != 0);
        matrix[ind] += 1;
        atomicExch(&amp;d_semaphores[ind], 0);
        __syncthreads();
    }
}
int main()
{
    int nof_indices = 6; // length of an array B
    int indices[6] = { 0,1,2,3,4,1 }; // array B; stores indices of an array A which to increment
    int canvas[10]; // array A
    int semaphores[10]; // mutex array with individual mutexes for each of array A elements
    int* d_canvas;
    int* d_indices;
    int* d_semaphores;
    memset(canvas, 0, sizeof(canvas)); // set all array A elements to 0
    memset(semaphores, 0, sizeof(semaphores)); // set all array A elements to 0    
    cudaMalloc(&amp;d_canvas, sizeof(canvas));
    cudaMalloc(&amp;d_semaphores, sizeof(semaphores));
    cudaMalloc(&amp;d_indices, sizeof(indices));
    cudaMemcpy(d_canvas, &amp;canvas, sizeof(canvas), cudaMemcpyHostToDevice);
    cudaMemcpy(d_indices, &amp;indices, sizeof(indices), cudaMemcpyHostToDevice);
    cudaMemcpy(d_semaphores, &amp;semaphores, sizeof(semaphores), cudaMemcpyHostToDevice);
    add_kernel &lt;&lt; &lt;1, 6 &gt;&gt; &gt; (d_canvas, d_indices, d_semaphores, nof_indices);
    cudaMemcpy(&amp;canvas, d_canvas, sizeof(canvas), cudaMemcpyDeviceToHost);
    for (int it = 0; it &lt; nof_indices; it++) {
        cout &lt;&lt; canvas[it] &lt;&lt; endl;
    }
    cudaFree(d_canvas);
    cudaFree(d_indices);
    cudaFree(d_semaphores);
    return 0;
}

In this example the resulting array matrix should look like this : {1, 2 ,1 ,1,1,0} , but I only get it when I run kernel with dimensions << 6,1 >>.

I'm using CUDA 12.1, Geforce RTX 3060

Thank you

( It only works when I set thread per block size to 1, but it's not what I want )

答案1

得分: 3

In a pre-volta execution model, this line of code is/would have been problematic:

在前伏尔泰执行模型中，这行代码会有问题：

while (atomicCAS(&d_semaphores[ind], 0, 1) != 0);

在这篇博客和一些Stack Overflow问题中，通常会对这个话题进行讨论，如这个问题和这个问题。

然而，正如博客中所指出的（以及其他地方），伏尔泰执行模型应该允许更灵活的范式。我认为这里的问题是由于nvcc的一个特性引起的：

为了在实施“独立线程调度”的纠正操作时提供迁移支持，Volta开发者可以使用编译器选项组合-arch=compute_60 -code=sm_70来选择Pascal的线程调度。

如果你为前伏尔泰架构编译，那么你告诉编译器你希望使用前伏尔泰的语义。这可能会影响你的代码的执行行为，例如，在你在伏尔泰或更新的架构上执行，但编译为前伏尔泰目标时。

根据我的测试，如果我在CUDA 12.1上使用默认开关编译，它将在sm_75上死锁，因为默认情况下选择了sm_52目标（包括PTX）。但是，如果我为sm_75目标编译，代码将以“正常”方式运行。

我认为，如果你为Volta或更新的目标编译，你的代码不会在RTX 3060上死锁。除非你有不同的原因，一般的建议是编译时指定你希望运行的目标架构。

英文:

In a pre-volta execution model, this line of code is/would have been problematic:

    while (atomicCAS(&amp;d_semaphores[ind], 0, 1) != 0);

The topic is addressed generally in this blog "Independent Thread Scheduling" and also in various SO questions such as this one and this one.

However, as indicated in the blog (and elsewhere) the volta execution model should allow more flexible paradigms. I believe the problem here is arising due to a feature of nvcc:

>To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.

If you compile for a pre-volta architecture, you are indicating to the compiler that you want pre-volta semantics. This may have an effect on the execution behavior of your code for example in the case where you are executing on a volta or newer architecture, but compiling for a pre-volta target.

According to my testing the code deadlocks on sm_75 if I compile using default switches on CUDA 12.1, which by default selects a sm_52 target (including PTX). However if I compile for a sm_75 target the code runs "normally".

I think your code will not deadlock on your RTX 3060 if you compile for a Volta or newer target. Unless you have a reason not to, a general recommendation is to compile specifying the target(s) you wish to run on.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

CUDA AtomicCAS 死锁

问题

答案1

继承的CombinedStream对象导致分段错误。

如何将一个C++程序上传到Pi Pico，使其在电源循环中持续存在。

如何避免在获取标准库函数地址时出现非标准行为

Android本地: 如何正确处理没有android_native_app_glue.h的输入？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。