问题

I am trying the global atomics vs shared atomics code from NVIDIA blog https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

But when I am trying to profile with Nsight Compute CLI, it shows an error for the shared atomics kernel.


==ERROR== LaunchFailed

==ERROR== LaunchFailed ==PROF== Trying to shutdown target application ==ERROR== The application returned an error code (9). ==ERROR== An error occurred while trying to profile. ==WARNING== Found outstanding GPU clock reset, trying to revert...Success. [16078] histogram@127.0.0.1 histogram_gmem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ dram__bytes.sum.per_second Gbyte/second 13,98 ---------------------------------------------------------------------- --------------- ------------------------------ histogram_smem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ dram__bytes.sum.per_second byte/second (!) nan ---------------------------------------------------------------------- --------------- ------------------------------

Why is this showing an error in ncu? For reference, my main function looks like this:


int main(){ int height = 480; int width = height;

auto nThread = 16; auto nBlock = (height) / nThread;

IN_TYPE* h_in_image, *d_in_image; unsigned int* d_out_image; h_in_image = (IN_TYPE *)malloc(height*width * sizeof(IN_TYPE)); cudaMalloc(&d_in_image, height*width * sizeof(IN_TYPE)); cudaMalloc(&d_out_image, height*width * sizeof(unsigned int));

for (int n = 0; n < (height*width); n++) { h_in_image[n].x = rand()%10; h_in_image[n].y = rand()%10; h_in_image[n].z = rand()%10; } cudaMemcpy(d_in_image, h_in_image, height*width * sizeof(IN_TYPE), cudaMemcpyHostToDevice);

histogram_gmem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image); cudaDeviceSynchronize();

// not copying the results back as of now

histogram_smem_atomics<<<nBlock, nThread>>>(d_in_image, width, height, d_out_image); cudaDeviceSynchronize(); } ```

<details>
<summary>英文:</summary>

I am trying the global atomics vs shared atomics code from NVIDIA blog https://developer.nvidia.com/blog/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

But when I am trying to profile with Nsight Compute CLI, it shows an error for the shared atomics kernel.

==PROF== Connected to process 16078
==PROF== Profiling "histogram_gmem_atomics" - 0: 0%....50%....100% - 1 pass
==PROF== Profiling "histogram_smem_atomics" - 1: 0%....50%....100% - 1 pass

==ERROR== LaunchFailed

==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== Found outstanding GPU clock reset, trying to revert...Success.
[16078] histogram@127.0.0.1
histogram_gmem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes.sum.per_second Gbyte/second 13,98
---------------------------------------------------------------------- --------------- ------------------------------

histogram_smem_atomics(const IN_TYPE *, int, int, unsigned int *), 2023-Mar-09 12:55:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__bytes.sum.per_second byte/second (!) nan
---------------------------------------------------------------------- --------------- ------------------------------

Why is this showing an error in ncu? For referance my main function looks like this:

#define NUM_BINS 480
#define NUM_PARTS 48

struct IN_TYPE
{
int x;
int y;
int z;
};

int main(){
int height = 480;
int width = height;

auto nThread = 16;
auto nBlock = (height) / nThread;

IN_TYPE* h_in_image, *d_in_image;
unsigned int* d_out_image;
h_in_image = (IN_TYPE *)malloc(height*width * sizeof(IN_TYPE));
cudaMalloc(&amp;d_in_image, height*width * sizeof(IN_TYPE));
cudaMalloc(&amp;d_out_image, height*width * sizeof(unsigned int));

for (int n = 0; n &lt; (height*width); n++)
{
    h_in_image[n].x = rand()%10;
    h_in_image[n].y = rand()%10;
    h_in_image[n].z = rand()%10;
}
cudaMemcpy(d_in_image, h_in_image, height*width * sizeof(IN_TYPE), cudaMemcpyHostToDevice);

histogram_gmem_atomics&lt;&lt;&lt;nBlock, nThread&gt;&gt;&gt;(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();

// not copying the results back as of now

histogram_smem_atomics&lt;&lt;&lt;nBlock, nThread&gt;&gt;&gt;(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();

}


</details>
# 答案1
**得分**: 1
以下是代码部分的翻译：
> Why is this showing an error in ncu?
为什么这在 ncu 中显示错误？
The blog in question expects that the pixel (component) values will be expressed as floating-point in the range of [0,1.0).
相关博客期望像浮点数一样表示像素（分量）值，范围在[0, 1.0]内。
This is why this kind of multiplication makes sense, for either the gmem or smem version:
这就是为什么这种类型的乘法对于gmem或smem版本都是有意义的：
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
so this is not correct:
所以这是不正确的：
struct IN_TYPE
{
int x;
int y;
int z;
};
结构IN_TYPE
{
int x;
int y;
int z;
};
Instead, you want something like:
相反，你想要像这样：
struct IN_TYPE
{
float x;
float y;
float z;
};
结构IN_TYPE
{
float x;
float y;
float z;
};
and make sure that you initialize those values (`x`, `y`, `z`) in a range of 0.0 to ~0.999999 max.
并确保你将这些值（`x`，`y`，`z`）初始化在0.0到~0.999999的范围内。
Based on the structure of the code, and as stated in the blog, I'm not sure that more than 256 bins makes any sense.
根据代码的结构，以及博客中的陈述，我不确定超过256个箱子是否有任何意义。
The code quantizes the `float` pixel values to an integer range of [0,255].
代码将浮点像素值量化为整数范围[0,255]。
For the global data, your settings for NUM_PARTS (effectively the number of bins times the number of color components, or "parts" of each histogram) and the size of the output array don't make sense.
对于全局数据，你对NUM_PARTS的设置（实际上是每个直方图的箱子数量乘以颜色分量的数量，或者每个直方图的“部分”）以及输出数组的大小是没有意义的。
When I address those items, the code runs without error for me:
当我处理这些问题时，代码在我的机器上没有错误运行：
$ cat t2209.cu
$ cat t2209.cu
#define NUM_BINS (256)
#define NUM_PARTS (3*NUM_BINS)
#define NUM_BINS（256）
#define NUM_PARTS（3*NUM_BINS）
struct IN_TYPE
{
float x;
float y;
float z;
};
结构IN_TYPE
{
float x;
float y;
float z;
}
__global__ void histogram_gmem_atomics(const IN_TYPE *in, int width, int height, unsigned int *out)
{
// pixel coordinates
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// 像素坐标
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// grid dimensions
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
// 网格尺寸
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
// linear thread index within 2D block
int t = threadIdx.x + threadIdx.y * blockDim.x;
// 2D块内的线性线程索引
int t = threadIdx.x + threadIdx.y * blockDim.x;
// total threads in 2D block
int nt = blockDim.x * blockDim.y;
// 2D块中的总线程数
int nt = blockDim.x * blockDim.y;
// linear block index within 2D grid
int g = blockIdx.x + blockIdx.y * gridDim.x;
// 2D网格中的线性块索引
int g = blockIdx.x + blockIdx.y * gridDim.x;
// initialize temporary accumulation array in global memory
unsigned int *gmem = out + g * NUM_PARTS;
for (int i = t; i < 3 * NUM_BINS; i += nt) gmem[i] = 0;
// 在全局内存中初始化临时累积数组
unsigned int *gmem = out + g * NUM_PARTS;
for (int i = t; i < 3 * NUM_BINS; i += nt) gmem[i] = 0;
// process pixels
// updates our block's partial histogram in global memory
for (int col = x; col < width; col += nx)
for (int row = y; row < height; row += ny) {
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
unsigned int g = (unsigned int)(256 * in[row * width + col].y);
unsigned int b = (unsigned int)(256 * in[row * width + col].z);
atomicAdd(&gmem[NUM_BINS * 0 + r], 1);
atomicAdd(&gmem[NUM_BINS * 1 + g], 1);
atomicAdd(&gmem[NUM_BINS * 2 + b], 1);
}
// 处理像素
// 更新我们块的全局内存中的部分直方图
for (int col = x; col < width; col += nx)
for (int row = y; row < height; row += ny) {
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
unsigned int g = (unsigned int)(256 * in[row * width + col].y);
unsigned int b = (unsigned int)(256 * in[row * width + col].z);
atomicAdd(&gmem[NUM_BINS * 0 + r], 1);
atomicAdd(&gmem[NUM_BINS * 1 + g], 1);
atomicAdd(&gmem[NUM_BINS * 2 + b], 1);
}
}
__global__ void histogram_smem_atomics(const IN_TYPE *in, int width, int height, unsigned int *out)
{
// pixel coordinates
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// 像素坐
<details>
<summary>英文:</summary>
&gt;Why is this showing an error in ncu?
The blog in question expects that the pixel (component) values will be expressed as floating-point in the range of [0,1.0).  This is why this kind of multiplication makes sense, for either the gmem or smem version:
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
^^^^^^
so this is not correct:
struct IN_TYPE
{
int x;
int y;
int z;
};
Instead, you want something like:
struct IN_TYPE
{
float x;
float y;
float z;
};
and make sure that you initialize those values (`x`, `y`, `z`) in a range of 0.0 to ~0.999999 max.
Based on the structure of the code, and as stated in the blog, I&#39;m not sure that more than 256 bins makes any sense.  The code quantizes the `float` pixel values to an integer range of [0,255].
For the global data, your settings for NUM_PARTS (effectively the number of bins times the number of color components, or &quot;parts&quot; of each histogram) and the size of the output array don&#39;t make sense.
When I address those items, the code runs without error for me:
$ cat t2209.cu
#define NUM_BINS (256)
#define NUM_PARTS (3*NUM_BINS)
struct IN_TYPE
{
float x;
float y;
float z;
};
__global__ void histogram_gmem_atomics(const IN_TYPE *in, int width, int height, unsigned int *out)
{
// pixel coordinates
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// grid dimensions
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
// linear thread index within 2D block
int t = threadIdx.x + threadIdx.y * blockDim.x;
// total threads in 2D block
int nt = blockDim.x * blockDim.y;
// linear block index within 2D grid
int g = blockIdx.x + blockIdx.y * gridDim.x;
// initialize temporary accumulation array in global memory
unsigned int *gmem = out + g * NUM_PARTS;
for (int i = t; i &lt; 3 * NUM_BINS; i += nt) gmem[i] = 0;
// process pixels
// updates our block&#39;s partial histogram in global memory
for (int col = x; col &lt; width; col += nx)
for (int row = y; row &lt; height; row += ny) {
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
unsigned int g = (unsigned int)(256 * in[row * width + col].y);
unsigned int b = (unsigned int)(256 * in[row * width + col].z);
atomicAdd(&amp;gmem[NUM_BINS * 0 + r], 1);
atomicAdd(&amp;gmem[NUM_BINS * 1 + g], 1);
atomicAdd(&amp;gmem[NUM_BINS * 2 + b], 1);
}
}
__global__ void histogram_smem_atomics(const IN_TYPE *in, int width, int height, unsigned int *out)
{
// pixel coordinates
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// grid dimensions
int nx = blockDim.x * gridDim.x;
int ny = blockDim.y * gridDim.y;
// linear thread index within 2D block
int t = threadIdx.x + threadIdx.y * blockDim.x;
// total threads in 2D block
int nt = blockDim.x * blockDim.y;
// linear block index within 2D grid
int g = blockIdx.x + blockIdx.y * gridDim.x;
// initialize temporary accumulation array in shared memory
__shared__ unsigned int smem[3 * NUM_BINS + 3];
for (int i = t; i &lt; 3 * NUM_BINS + 3; i += nt) smem[i] = 0;
__syncthreads();
// process pixels
// updates our block&#39;s partial histogram in shared memory
for (int col = x; col &lt; width; col += nx)
for (int row = y; row &lt; height; row += ny) {
unsigned int r = (unsigned int)(256 * in[row * width + col].x);
unsigned int g = (unsigned int)(256 * in[row * width + col].y);
unsigned int b = (unsigned int)(256 * in[row * width + col].z);
atomicAdd(&amp;smem[NUM_BINS * 0 + r + 0], 1);
atomicAdd(&amp;smem[NUM_BINS * 1 + g + 1], 1);
atomicAdd(&amp;smem[NUM_BINS * 2 + b + 2], 1);
}
__syncthreads();
// write partial histogram into the global memory
out += g * NUM_PARTS;
for (int i = t; i &lt; NUM_BINS; i += nt) {
out[i + NUM_BINS * 0] = smem[i + NUM_BINS * 0];
out[i + NUM_BINS * 1] = smem[i + NUM_BINS * 1 + 1];
out[i + NUM_BINS * 2] = smem[i + NUM_BINS * 2 + 2];
}
}
int main(){
int height = 480;
int width = height;
auto nThread = 16;
auto nBlock = (height) / nThread;
IN_TYPE* h_in_image, *d_in_image;
unsigned int* d_out_image;
h_in_image = (IN_TYPE *)malloc(height*width * sizeof(IN_TYPE));
cudaMalloc(&amp;d_in_image, height*width * sizeof(IN_TYPE));
cudaMalloc(&amp;d_out_image, nBlock*NUM_PARTS * sizeof(unsigned int));
for (int n = 0; n &lt; (height*width); n++)
{
h_in_image[n].x = rand()/(float)RAND_MAX;
h_in_image[n].y = rand()/(float)RAND_MAX;
h_in_image[n].z = rand()/(float)RAND_MAX;
}
cudaMemcpy(d_in_image, h_in_image, height*width * sizeof(IN_TYPE), cudaMemcpyHostToDevice);
histogram_gmem_atomics&lt;&lt;&lt;nBlock, nThread&gt;&gt;&gt;(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();
// not copying the results back as of now
histogram_smem_atomics&lt;&lt;&lt;nBlock, nThread&gt;&gt;&gt;(d_in_image, width, height, d_out_image);
cudaDeviceSynchronize();
}
$ nvcc -o t2209 t2209.cu
$ compute-sanitizer ./t2209
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$
</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

在NVIDIA Nsight Compute中对共享内存原子核进行性能分析时发生错误。

问题

内存控制器在传播缓存行时如何确保原子内存顺序？

仅有一个线程的块是否高效？

atomic.LoadPointer函数用于原子加载指针的操作。

PyTorch版本适用于CUDA 12.2。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论