英文:
Why can't i launch concurrent kernels with the same pointer?
问题
在启动具有相同数据指针的并发内核时,我收到了一个"读取位置错误"异常。
我可以通过cudaDeviceSynchronize()轻松解决这个问题,但我想理解出了什么问题。
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
struct Small {
int a = 0;
};
struct Big {
Small* small;
};
__global__ void kernelA(Small* small) {
// Do nothing
}
__global__ void kernelB(Small* small) {
// Do nothing
}
void myFunc(Big* big) {
kernelA << <10, 10 >> > (big->small);
//cudaDeviceSynchronize();
kernelB << <10, 10 >> > (big->small);
}
int main()
{
Big* big;
cudaMallocManaged(&big, sizeof(Big));
cudaMallocManaged(&big->small, sizeof(Small));
myFunc(big);
return 0;
}
如果我取消注释"cudaDeviceSynchronize();",它可以工作,但我不明白为什么无论哪种方式都无法工作。
我已阅读了类似但不完全相同的问题:https://stackoverflow.com/questions/30706945/cuda-kernel-concurrency-with-common-shared-argument
那么出了什么问题,为什么两个内核不能有相同的指针?
附加信息
我正在使用Visual Studio编译compute_52,sm_52用于Windows
英文:
When launching concurrent kernels with the same data pointer, i get an "error reading location" exception.
I can easily fix this problem with cudaDeviceSynchronize(), but i want to understand what goes wrong.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
struct Small {
int a = 0;
};
struct Big {
Small* small;
};
__global__ void kernelA(Small* small) {
// Do nothing
}
__global__ void kernelB(Small* small) {
// Do nothing
}
void myFunc(Big* big) {
kernelA << <10, 10 >> > (big->small);
//cudaDeviceSynchronize();
kernelB << <10, 10 >> > (big->small);
}
int main()
{
Big* big;
cudaMallocManaged(&big, sizeof(Big));
cudaMallocManaged(&big->small, sizeof(Small));
myFunc(big);
return 0;
}
If i uncomment "cudaDeviceSynchronize();" it works, but i don't understand why it wont work either way.
I have read this issue which is similar, but not quite: https://stackoverflow.com/questions/30706945/cuda-kernel-concurrency-with-common-shared-argument
So what is going wrong, why can't two kernels have the same pointer?
Additional information
I am using Visual Studio to compile compute_52,sm_52 for windows
答案1
得分: 2
你似乎正在使用Windows,这是一个管理内存的预帕斯卡制度,无论你使用的是哪种GPU。
在这种模式下,一旦启动一个内核,所有的管理内存分配都将无法被主机代码访问。所以big
是主机内存中的位置,因此在主机代码中检索big
的指针值始终是合法的。但在你的情况下,big
指向的是管理内存。要获取该值,需要对big
进行解引用并检索指针偏移处的值。这个解引用操作会导致访问管理分配,在主机代码中是非法的(在预帕斯卡制度下,在内核启动之后,在任何cudaDeviceSynchronize()
之前)。内核启动过程涉及主机代码检索内核参数。
因此,你将会遇到段错误。
这与“相同指针”并没有太多关系。它与管理分配有关,在预帕斯卡制度下必须从主机代码中访问。
一些修复思路:
-
正如你已经发现的,之间加入
cudaDeviceSynchronize()
,这将恢复对管理分配的主机代码访问权限。 -
切换到需求分页管理环境(Linux,在Pascal或更新的GPU上)。
-
在主机变量中捕获相关指针,以便在第二次内核启动时不需要对管理指针进行解引用来访问主机代码:
Small *temp = big->small;
kernelA << <10, 10>> >> (temp); // 或者(big->small); 这里
kernelB << <10, 10>> >> (temp); // 必须在这里使用temp
另外,如果在最后一个内核启动后没有cudaDeviceSynchronize()
或其他同步函数(如cudaMemcpy()
),则意味着在你的内核实际执行之前,进程可能会被终止,但我假设你只是在这里展示一个最小的示例。仍然建议包括这样的同步函数。
英文:
You appear to be on windows, which is a pre-pascal regime for managed memory, regardless of what GPU you are running on.
In this modality, as soon as you launch a kernel, all managed allocations become inaccessible to host code. So big
is a location in host memory, and therefore retrieving the pointer value of big
is always legal in host code. But what big
points to in your case is managed memory. To get that value, it's necessary to dereference big
and retrieve a value offset from that pointer. That dereference operation results in accessing a managed allocation, which is illegal in host code (in a pre-pascal regime, after a kernel launch, before any cudaDeviceSynchronize()
). The kernel launch process involves host code retrieval of the kernel arguments.
Therefore you are going to hit a seg fault.
And this doesn't have much to do with the "same pointer". It has to do with managed allocations, that must be accessed from host code in pre-pascal regime.
Some fix ideas:
-
as you already discovered, put a
cudaDeviceSynchronize()
in between, which restores host-code accessibility to the managed allocation. -
switch to a demand-paged managed environment (linux, on Pascal or newer GPU).
-
capture the pointer in question in a host variable, so that dereferencing of a managed pointer is not required for host code access at the point of the second kernel launch:
Small *temp = big->small;
kernelA << <10, 10 >> > (temp); // or (big->small); here
kernelB << <10, 10 >> > (temp); // must use temp here
As an aside, not having a cudaDeviceSynchronize()
or other synchronizing function (like cudaMemcpy()
) after the last kernel launch means that the process may get torn down before your kernels actually execute, but I assume you're just showing a minimal example here. Still I would advise the inclusion of such.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论