为什么无法使用相同指针启动并发内核?

huangapple go评论52阅读模式
英文:

Why can't i launch concurrent kernels with the same pointer?

问题

在启动具有相同数据指针的并发内核时,我收到了一个"读取位置错误"异常。

我可以通过cudaDeviceSynchronize()轻松解决这个问题,但我想理解出了什么问题。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

struct Small {
    int a = 0;
};

struct Big {
    Small* small;
};

__global__ void kernelA(Small* small) {
    // Do nothing
}
__global__ void kernelB(Small* small) {
    // Do nothing
}

void myFunc(Big* big) {
    kernelA << <10, 10 >> > (big->small);
    //cudaDeviceSynchronize();
    kernelB << <10, 10 >> > (big->small);
}

int main()
{
    Big* big;
    cudaMallocManaged(&big, sizeof(Big));
    cudaMallocManaged(&big->small, sizeof(Small));

    myFunc(big);

    return 0;
}

如果我取消注释"cudaDeviceSynchronize();",它可以工作,但我不明白为什么无论哪种方式都无法工作。

我已阅读了类似但不完全相同的问题:https://stackoverflow.com/questions/30706945/cuda-kernel-concurrency-with-common-shared-argument

那么出了什么问题,为什么两个内核不能有相同的指针?

附加信息

我正在使用Visual Studio编译compute_52,sm_52用于Windows

英文:

When launching concurrent kernels with the same data pointer, i get an "error reading location" exception.

I can easily fix this problem with cudaDeviceSynchronize(), but i want to understand what goes wrong.

#include &quot;cuda_runtime.h&quot;
#include &quot;device_launch_parameters.h&quot;

struct Small {
    int a = 0;
};

struct Big {
    Small* small;
};

__global__ void kernelA(Small* small) {
    // Do nothing
}
__global__ void kernelB(Small* small) {
    // Do nothing
}

void myFunc(Big* big) {
    kernelA &lt;&lt; &lt;10, 10 &gt;&gt; &gt; (big-&gt;small);
    //cudaDeviceSynchronize();
    kernelB &lt;&lt; &lt;10, 10 &gt;&gt; &gt; (big-&gt;small);
}


int main()
{
    Big* big;
    cudaMallocManaged(&amp;big, sizeof(Big));
    cudaMallocManaged(&amp;big-&gt;small, sizeof(Small));

    myFunc(big);

    return 0;
}

If i uncomment "cudaDeviceSynchronize();" it works, but i don't understand why it wont work either way.

I have read this issue which is similar, but not quite: https://stackoverflow.com/questions/30706945/cuda-kernel-concurrency-with-common-shared-argument

So what is going wrong, why can't two kernels have the same pointer?

Additional information

I am using Visual Studio to compile compute_52,sm_52 for windows

答案1

得分: 2

你似乎正在使用Windows,这是一个管理内存的预帕斯卡制度,无论你使用的是哪种GPU。

在这种模式下,一旦启动一个内核,所有的管理内存分配都将无法被主机代码访问。所以big是主机内存中的位置,因此在主机代码中检索big的指针值始终是合法的。但在你的情况下,big指向的是管理内存。要获取该值,需要对big进行解引用并检索指针偏移处的值。这个解引用操作会导致访问管理分配,在主机代码中是非法的(在预帕斯卡制度下,在内核启动之后,在任何cudaDeviceSynchronize()之前)。内核启动过程涉及主机代码检索内核参数。

因此,你将会遇到段错误。

这与“相同指针”并没有太多关系。它与管理分配有关,在预帕斯卡制度下必须从主机代码中访问。

一些修复思路:

  • 正如你已经发现的,之间加入cudaDeviceSynchronize(),这将恢复对管理分配的主机代码访问权限。

  • 切换到需求分页管理环境(Linux,在Pascal或更新的GPU上)。

  • 在主机变量中捕获相关指针,以便在第二次内核启动时不需要对管理指针进行解引用来访问主机代码:

    Small *temp = big->small;
    kernelA << <10, 10>> >> (temp);  // 或者(big->small); 这里
    kernelB << <10, 10>> >> (temp);  // 必须在这里使用temp
    

另外,如果在最后一个内核启动后没有cudaDeviceSynchronize()或其他同步函数(如cudaMemcpy()),则意味着在你的内核实际执行之前,进程可能会被终止,但我假设你只是在这里展示一个最小的示例。仍然建议包括这样的同步函数。

英文:

You appear to be on windows, which is a pre-pascal regime for managed memory, regardless of what GPU you are running on.

In this modality, as soon as you launch a kernel, all managed allocations become inaccessible to host code. So big is a location in host memory, and therefore retrieving the pointer value of big is always legal in host code. But what big points to in your case is managed memory. To get that value, it's necessary to dereference big and retrieve a value offset from that pointer. That dereference operation results in accessing a managed allocation, which is illegal in host code (in a pre-pascal regime, after a kernel launch, before any cudaDeviceSynchronize()). The kernel launch process involves host code retrieval of the kernel arguments.

Therefore you are going to hit a seg fault.

And this doesn't have much to do with the "same pointer". It has to do with managed allocations, that must be accessed from host code in pre-pascal regime.

Some fix ideas:

  • as you already discovered, put a cudaDeviceSynchronize() in between, which restores host-code accessibility to the managed allocation.

  • switch to a demand-paged managed environment (linux, on Pascal or newer GPU).

  • capture the pointer in question in a host variable, so that dereferencing of a managed pointer is not required for host code access at the point of the second kernel launch:

    Small *temp = big-&gt;small;
    kernelA &lt;&lt; &lt;10, 10 &gt;&gt; &gt; (temp);  // or (big-&gt;small); here
    kernelB &lt;&lt; &lt;10, 10 &gt;&gt; &gt; (temp);  // must use temp here
    

As an aside, not having a cudaDeviceSynchronize() or other synchronizing function (like cudaMemcpy()) after the last kernel launch means that the process may get torn down before your kernels actually execute, but I assume you're just showing a minimal example here. Still I would advise the inclusion of such.

huangapple
  • 本文由 发表于 2023年6月8日 06:03:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/76427389.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定