英文:
Why is ‘issused warp per scheduler’ so low in code full with IMAD.WIDE instruction in NVIDIA ?
问题
The 'issued warp per scheduler' metric being lower than expected with IMAD.WIDE instructions could be due to the architecture-specific characteristics and resource limitations. In your code, you are using IMAD.WIDE instructions which have different resource requirements compared to regular IMAD instructions.
IMAD.WIDE instructions might require more resources in terms of execution units, registers, or other hardware components. This increased resource usage could limit the number of warps that can be issued per scheduler, resulting in a lower value for 'issued warp per scheduler' compared to regular IMAD instructions.
To get closer to the max value of 0.5, you may need to optimize your code further, possibly by reducing resource consumption or improving instruction-level parallelism. It's also possible that the specific architecture or configuration you are using has limitations that prevent achieving the max value.
Optimizing CUDA code for specific GPU architectures often involves experimentation and profiling to understand how different instructions and resource usage affect performance. You can also refer to NVIDIA's official documentation and forums for architecture-specific optimization tips and insights.
英文:
I use NVIDIA 2080Ti for test,and my test code is follow
#include <stdio.h>
#include <stdint.h>
#include <chrono>
#include <cuda.h>
struct item_t {
uint32_t data[12];
inline uint32_t& __device__ operator[](size_t i) { return data[i]; }
inline const uint32_t& operator[](size_t i) const { return data[i]; }
};
__global__ void test(const item_t *__restrict__ in, item_t *out, int count) {
constexpr int n = sizeof(in->data) / sizeof(uint32_t);
int curTh = threadIdx.x + blockIdx.x * blockDim.x;
if (curTh >= count)
return;
item_t a = in[curTh * 2], b = in[curTh * 2 + 1];
item_t accd[2] = {out[curTh * 2], out[curTh * 2 + 1]};
auto *acc = (uint32_t *)accd;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < n; ++j) {
uint32_t bj = b[j];
for (size_t k = 0; k < n; k++) {
asm("mad.lo.cc.u32 %0, %2, %3, %0; madc.hi.cc.u32 %1, %2, %3, %1;"
: "+r"(acc[k * 2]), "+r"(acc[k * 2 + 1])
: "r"(a[k]), "r"(bj));
// asm("mad.lo.u32 %0, %2, %3, %0;"
// : "+r"(acc[k * 2]), "+r"(acc[k * 2 + 1])
// : "r"(a[k]), "r"(bj));
}
}
}
out[curTh*2] = accd[0];
out[curTh*2+1] = accd[1];;
}
typedef std::chrono::high_resolution_clock Clock;
int main(int argc, char *argv[])
{
cudaSetDevice(0);
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
int nth = 1;
if (argc > 1) {
nth = atoi(argv[1]);
}
int NTHREADS = 32;
printf("nth: %d\n", nth);
item_t *d_in;
item_t *d_out;
cudaMalloc(&d_out, nth * sizeof(*d_out) * 2);
cudaMalloc(&d_in, nth * sizeof(*d_in) * 2);
printf("run 10 times\n");
for (int i = 0; i < 10; ++i) {
auto start = Clock::now();
test<<<nth/NTHREADS + 1, NTHREADS, 0, stream>>>(d_in, d_out, nth);
cudaStreamSynchronize(stream);
printf("use: %02fms\n", (Clock::now() - start) / 1000 / 1000.0);
}
return 0;
}
and then
> nvcc -std=c++17 -O0 -g -arch=sm_75 test.cu -o test
>
> cuobjdump -sass test
and then the output of sass code is follow
> ...... ......
>
> /0440/ MOV R31, R53 ; /* 0x00000035001f7202 /
> / 0x000fe20000000f00 /
> /0450/ IMAD.WIDE.U32 R28, R44, R21, R28 ; / 0x000000152c1c7225 /
> / 0x000fe200078e001c /
> /0460/ MOV R53, R51 ; / 0x0000003300357202 /
> / 0x000fc40000000f00 /
> /0470/ IADD3 R45, R45, 0x1, RZ ; / 0x000000012d2d7810 /
> / 0x000fe20007ffe0ff /
> /0480/ IMAD.WIDE.U32 R6, R42, R22, R6 ; / 0x000000162a067225 /
> / 0x000fc600078e0006 /
> /0490/ ISETP.NE.AND P0, PT, R45, 0x2710, PT ; / 0x000027102d00780c /
> / 0x000fe20003f05270 /
> /04a0/ IMAD.WIDE.U32 R28, R44, R20, R28 ; / 0x000000142c1c7225 /
> / 0x000fc800078e001c /
> /04b0/ IMAD.WIDE.U32 R58, R41, R22, R4 ; / 0x00000016293a7225 /
> / 0x000fc800078e0004 /
> /04c0/ IMAD.WIDE.U32 R46, R40, R22, R2 ; / 0x00000016282e7225 /
> / 0x000fc800078e0002 /
> /04d0/ IMAD.WIDE.U32 R32, R43, R21, R8 ; / 0x000000152b207225 /
> / 0x000fc800078e0008 /
> /04e0/ IMAD.WIDE.U32 R4, R42, R21, R6 ; / 0x000000152a047225 /
> / 0x000fc800078e0006 /
> /04f0/ IMAD.WIDE.U32 R8, R44, R19, R28 ; / 0x000000132c087225 /
> / 0x000fc800078e001c /
> /0500/ IMAD.WIDE.U32 R2, R41, R21, R58 ; / 0x0000001529027225 /
> / 0x000fc800078e003a /
> /0510/ IMAD.WIDE.U32 R30, R39, R22, R30 ; / 0x00000016271e7225 /
> / 0x000fc800078e001e /
> /0520/ IMAD.WIDE.U32 R28, R40, R21, R46 ; / 0x00000015281c7225 /
> / 0x000fc800078e002e /
> /0530/ IMAD.WIDE.U32 R6, R43, R20, R32 ; / 0x000000142b067225 /
> / 0x000fc800078e0020 /
> /0540/ IMAD.WIDE.U32 R4, R42, R20, R4 ; / 0x000000142a047225 /
> / 0x000fc800078e0004 /
> /0550/ IMAD.WIDE.U32 R2, R41, R20, R2 ; / 0x0000001429027225 /
> / 0x000fc800078e0002 /
> /0560/ IMAD.WIDE.U32 R30, R39, R21, R30 ; / 0x00000015271e7225 /
> / 0x000fc800078e001e /
> /0570/ IMAD.WIDE.U32 R28, R40, R20, R28 ; / 0x00000014281c7225 /
> / 0x000fc800078e001c /
> /0580/ IMAD.WIDE.U32 R6, R43, R19, R6 ; / 0x000000132b067225 /
> / 0x000fc800078e0006 /
> /0590/ IMAD.WIDE.U32 R4, R42, R19, R4 ; / 0x000000132a047225 /
> / 0x000fc800078e0004 /
> /05a0/ IMAD.WIDE.U32 R2, R41, R19, R2 ; / 0x0000001329027225 /
> / 0x000fc800078e0002 /
> /05b0/ IMAD.WIDE.U32 R30, R39, R20, R30 ; / 0x00000014271e7225 /
> / 0x000fc800078e001e /
> /05c0/ IMAD.WIDE.U32 R28, R40, R19, R28 ; / 0x00000013281c7225 /
> / 0x000fc800078e001c /
> /05d0/ IMAD.WIDE.U32 R8, R44, R18, R8 ; / 0x000000122c087225 /
> / 0x000fc800078e0008 /
> /05e0/ IMAD.WIDE.U32 R6, R43, R18, R6 ; / 0x000000122b067225 */ ...... ......
The ncu-gui run output show follow
So my question is, Why is 'issused warp per scheduler' so low ?
As I know, In turing architecture, each two cycle can issue one IMAD instruction, And in my code, each instruction is no data dependency in registers, So the value 'issused warp per scheduler' could reach the max value of 0.5
If I use IMAD instead of IMAD.WIDE, in other words I use the follow commented code
// asm("mad.lo.u32 %0, %2, %3, %0;"
// : "+r"(acc[k * 2]), "+r"(acc[k * 2 + 1])
// : "r"(a[k]), "r"(bj));
and then I got the value 'issused warp per scheduler' close to 0.5
So why can't I get the value close to 0.5 with IMAD.WIDE instruction? and Is it possible to do ?
答案1
得分: 4
A TU10x SM有4个子分区(SMSP),每2个周期可以发出一条IMAD指令。IMAD.WIDE
会发出(但不会增加发出活动计数器)该指令2次,以执行低32位操作和高32位操作,从而得到完整的64位值。每个子分区的吞吐量为每4个周期1条指令。
IMAD.{!WIDE, !HI}
可以达到最大的发出槽位利用率为0.5。IMAD.{WIDE, HI}
可以达到最大的发出槽位利用率为0.25。
英文:
A TU10x SM has 4 sub-partitions (SMSP) that can issue a IMAD instruction every 2 cycles. IMAD.WIDE
issues (but doesn't increment the issue active counter) the instruction 2 times to perform the lower 32-bit operation and the upper 32-bit operation resulting in a full 64-bit value. The throughput per sub-partition is 1 instruction every 4 cycles.
IMAD.{!WIDE, !HI}
can reach a maximum issue slot utilization of .5.IMAD.{WIDE, HI}
can reach a maximum issue slot utilization of .25.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论