2023年4月17日 16:45:44go评论124阅读模式

英文:

Why is ‘issused warp per scheduler’ so low in code full with IMAD.WIDE instruction in NVIDIA ？

问题

The 'issued warp per scheduler' metric being lower than expected with IMAD.WIDE instructions could be due to the architecture-specific characteristics and resource limitations. In your code, you are using IMAD.WIDE instructions which have different resource requirements compared to regular IMAD instructions.

IMAD.WIDE instructions might require more resources in terms of execution units, registers, or other hardware components. This increased resource usage could limit the number of warps that can be issued per scheduler, resulting in a lower value for 'issued warp per scheduler' compared to regular IMAD instructions.

To get closer to the max value of 0.5, you may need to optimize your code further, possibly by reducing resource consumption or improving instruction-level parallelism. It's also possible that the specific architecture or configuration you are using has limitations that prevent achieving the max value.

Optimizing CUDA code for specific GPU architectures often involves experimentation and profiling to understand how different instructions and resource usage affect performance. You can also refer to NVIDIA's official documentation and forums for architecture-specific optimization tips and insights.

英文:

I use NVIDIA 2080Ti for test，and my test code is follow

#include &lt;stdio.h&gt;
#include &lt;stdint.h&gt;
#include &lt;chrono&gt;
#include &lt;cuda.h&gt;
struct item_t {
    uint32_t data[12];
    inline uint32_t&amp; __device__ operator[](size_t i)               { return data[i]; }
    inline const uint32_t&amp; operator[](size_t i) const   { return data[i]; }
};
__global__ void test(const item_t *__restrict__ in, item_t *out, int count) {
    constexpr int n = sizeof(in-&gt;data) / sizeof(uint32_t);
    int curTh = threadIdx.x + blockIdx.x * blockDim.x;
    if (curTh &gt;= count)
        return;
    item_t a = in[curTh * 2], b = in[curTh * 2 + 1];
    item_t accd[2] = {out[curTh * 2], out[curTh * 2 + 1]};
    auto *acc = (uint32_t *)accd;
    for (int i = 0; i &lt; 10000; ++i) {
        for (int j = 0; j &lt; n; ++j) {
            uint32_t bj = b[j];
            for (size_t k = 0; k &lt; n; k++) {
                asm(&quot;mad.lo.cc.u32 %0, %2, %3, %0; madc.hi.cc.u32 %1, %2, %3, %1;&quot;
                        : &quot;+r&quot;(acc[k * 2]), &quot;+r&quot;(acc[k * 2 + 1])
                        : &quot;r&quot;(a[k]), &quot;r&quot;(bj));
//                asm(&quot;mad.lo.u32 %0, %2, %3, %0;&quot;
//                        : &quot;+r&quot;(acc[k * 2]), &quot;+r&quot;(acc[k * 2 + 1])
//                        : &quot;r&quot;(a[k]), &quot;r&quot;(bj));
            }
        }
    }
    out[curTh*2]   = accd[0];
    out[curTh*2+1] = accd[1];;
}
typedef std::chrono::high_resolution_clock Clock;
int main(int argc, char *argv[])
{
    cudaSetDevice(0);
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&amp;stream, cudaStreamNonBlocking);
    int nth = 1;
    if (argc &gt; 1) {
        nth = atoi(argv[1]);
    }
    int NTHREADS = 32;
    printf(&quot;nth: %d\n&quot;, nth);
    item_t *d_in;
    item_t *d_out;
    cudaMalloc(&amp;d_out, nth * sizeof(*d_out) * 2);
    cudaMalloc(&amp;d_in, nth * sizeof(*d_in) * 2);
    printf(&quot;run 10 times\n&quot;);
    for (int i = 0; i &lt; 10; ++i) {
        auto start = Clock::now();
        test&lt;&lt;&lt;nth/NTHREADS + 1, NTHREADS, 0, stream&gt;&gt;&gt;(d_in, d_out, nth);
        cudaStreamSynchronize(stream);
        printf(&quot;use: %02fms\n&quot;, (Clock::now() - start) / 1000 / 1000.0);
    }
    return 0;
}

and then

> nvcc -std=c++17 -O0 -g -arch=sm_75 test.cu -o test
>
> cuobjdump -sass test

and then the output of sass code is follow

> ...... ......
>
> /0440/ MOV R31, R53 ; /* 0x00000035001f7202 /
> / 0x000fe20000000f00 /
> /0450/ IMAD.WIDE.U32 R28, R44, R21, R28 ; / 0x000000152c1c7225 /
> / 0x000fe200078e001c /
> /0460/ MOV R53, R51 ; / 0x0000003300357202 /
> / 0x000fc40000000f00 /
> /0470/ IADD3 R45, R45, 0x1, RZ ; / 0x000000012d2d7810 /
> / 0x000fe20007ffe0ff /
> /0480/ IMAD.WIDE.U32 R6, R42, R22, R6 ; / 0x000000162a067225 /
> / 0x000fc600078e0006 /
> /0490/ ISETP.NE.AND P0, PT, R45, 0x2710, PT ; / 0x000027102d00780c /
> / 0x000fe20003f05270 /
> /04a0/ IMAD.WIDE.U32 R28, R44, R20, R28 ; / 0x000000142c1c7225 /
> / 0x000fc800078e001c /
> /04b0/ IMAD.WIDE.U32 R58, R41, R22, R4 ; / 0x00000016293a7225 /
> / 0x000fc800078e0004 /
> /04c0/ IMAD.WIDE.U32 R46, R40, R22, R2 ; / 0x00000016282e7225 /
> / 0x000fc800078e0002 /
> /04d0/ IMAD.WIDE.U32 R32, R43, R21, R8 ; / 0x000000152b207225 /
> / 0x000fc800078e0008 /
> /04e0/ IMAD.WIDE.U32 R4, R42, R21, R6 ; / 0x000000152a047225 /
> / 0x000fc800078e0006 /
> /04f0/ IMAD.WIDE.U32 R8, R44, R19, R28 ; / 0x000000132c087225 /
> / 0x000fc800078e001c /
> /0500/ IMAD.WIDE.U32 R2, R41, R21, R58 ; / 0x0000001529027225 /
> / 0x000fc800078e003a /
> /0510/ IMAD.WIDE.U32 R30, R39, R22, R30 ; / 0x00000016271e7225 /
> / 0x000fc800078e001e /
> /0520/ IMAD.WIDE.U32 R28, R40, R21, R46 ; / 0x00000015281c7225 /
> / 0x000fc800078e002e /
> /0530/ IMAD.WIDE.U32 R6, R43, R20, R32 ; / 0x000000142b067225 /
> / 0x000fc800078e0020 /
> /0540/ IMAD.WIDE.U32 R4, R42, R20, R4 ; / 0x000000142a047225 /
> / 0x000fc800078e0004 /
> /0550/ IMAD.WIDE.U32 R2, R41, R20, R2 ; / 0x0000001429027225 /
> / 0x000fc800078e0002 /
> /0560/ IMAD.WIDE.U32 R30, R39, R21, R30 ; / 0x00000015271e7225 /
> / 0x000fc800078e001e /
> /0570/ IMAD.WIDE.U32 R28, R40, R20, R28 ; / 0x00000014281c7225 /
> / 0x000fc800078e001c /
> /0580/ IMAD.WIDE.U32 R6, R43, R19, R6 ; / 0x000000132b067225 /
> / 0x000fc800078e0006 /
> /0590/ IMAD.WIDE.U32 R4, R42, R19, R4 ; / 0x000000132a047225 /
> / 0x000fc800078e0004 /
> /05a0/ IMAD.WIDE.U32 R2, R41, R19, R2 ; / 0x0000001329027225 /
> / 0x000fc800078e0002 /
> /05b0/ IMAD.WIDE.U32 R30, R39, R20, R30 ; / 0x00000014271e7225 /
> / 0x000fc800078e001e /
> /05c0/ IMAD.WIDE.U32 R28, R40, R19, R28 ; / 0x00000013281c7225 /
> / 0x000fc800078e001c /
> /05d0/ IMAD.WIDE.U32 R8, R44, R18, R8 ; / 0x000000122c087225 /
> / 0x000fc800078e0008 /
> /05e0/ IMAD.WIDE.U32 R6, R43, R18, R6 ; / 0x000000122b067225 */ ...... ......

The ncu-gui run output show follow

So my question is， Why is 'issused warp per scheduler' so low ？

As I know， In turing architecture， each two cycle can issue one IMAD instruction， And in my code, each instruction is no data dependency in registers， So the value 'issused warp per scheduler' could reach the max value of 0.5

If I use IMAD instead of IMAD.WIDE， in other words I use the follow commented code

 //                asm(&quot;mad.lo.u32 %0, %2, %3, %0;&quot;
//                        : &quot;+r&quot;(acc[k * 2]), &quot;+r&quot;(acc[k * 2 + 1])
//                        : &quot;r&quot;(a[k]), &quot;r&quot;(bj));

and then I got the value 'issused warp per scheduler' close to 0.5

So why can't I get the value close to 0.5 with IMAD.WIDE instruction? and Is it possible to do ？

答案1

得分: 4

A TU10x SM有4个子分区（SMSP），每2个周期可以发出一条IMAD指令。IMAD.WIDE会发出（但不会增加发出活动计数器）该指令2次，以执行低32位操作和高32位操作，从而得到完整的64位值。每个子分区的吞吐量为每4个周期1条指令。

IMAD.{!WIDE, !HI}可以达到最大的发出槽位利用率为0.5。
IMAD.{WIDE, HI}可以达到最大的发出槽位利用率为0.25。

英文:

A TU10x SM has 4 sub-partitions (SMSP) that can issue a IMAD instruction every 2 cycles. IMAD.WIDE issues (but doesn't increment the issue active counter) the instruction 2 times to perform the lower 32-bit operation and the upper 32-bit operation resulting in a full 64-bit value. The throughput per sub-partition is 1 instruction every 4 cycles.

IMAD.{!WIDE, !HI} can reach a maximum issue slot utilization of .5.
IMAD.{WIDE, HI} can reach a maximum issue slot utilization of .25.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

‘issused warp per scheduler’在充满IMAD.WIDE指令的NVIDIA代码中为什么这么低？

问题

答案1

为什么将容器的元素分配给容器本身（不）是一个明确定义的C++行为？

一个填充二维向量列的高效方法

如何在C++和Java之间使用JNI（Java Native Interface）正确管理内存释放？

`boost::shared_mutex`与`boost::upgrade_mutex`之间的区别是什么？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。