Cublas gemms不尊重NaN输入

huangapple go评论48阅读模式
英文:

Cublas gemms not respecting NaN inputs

问题

我遇到了cublas的奇怪行为。

考虑以下代码:

void test(float matrixValue) {
    cublasHandle_t handle;
    cublasCreate(&handle);

    float *A, *B, *C;
    cudaMallocHost(&A, sizeof(float) * 4);
    cudaMallocHost(&B, sizeof(float) * 4);
    cudaMallocHost(&C, sizeof(float) * 4);

    for (int i = 0; i < 4; i++) {
        A[i] = matrixValue;
        B[i] = matrixValue;
        C[i] = 123.0f;
    }

    float *dA, *dB, *dC;
    cudaMalloc(&dA, sizeof(float) * 4);
    cudaMalloc(&dB, sizeof(float) * 4);
    cudaMalloc(&dC, sizeof(float) * 4);
    
    cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);

    float alpha = 0.0f;
    float beta = 0.0f;

    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);

    cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 4; i++) {
        printf("%f ", C[i]);
    }
    printf("\n");

    cublasDestroy(handle);
    cudaFreeHost(A);
    cudaFreeHost(B);
    cudaFreeHost(C);
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
}

int main() {
    test(std::numeric_limits<float>::signaling_NaN());
    test(std::numeric_limits<float>::quiet_NaN());
}

我们首先在CPU上分配了三个缓冲区(A、B、C),用NaN填充A和B,用随机的标志值(123)填充C,然后将它们上传到GPU,在它们上执行gemm操作,将结果下载到C缓冲区,然后打印C中的值。到目前为止一切都好。

如果我们在执行gemm时使用alpha=1.0,那么C缓冲区将充满NaN,这是可以预期的,因为涉及NaN的任何计算都将导致NaN。

但如果我们使用alpha=0,即使0 * NaN应该结果为NaN,C缓冲区仍然充满零。dC矩阵仍然被覆盖,这意味着gemm(至少是累积部分)没有被跳过。我在Nvidia文档中找不到说明,说明当alpha等于零时,gemm操作的乘法部分可以被跳过。在这种情况下,这是否使CuBLAS不符合IEEE754标准,当A或B包含NaN并且alpha设置为零时?

英文:

I came across a strange behaviour with cublas.

Consider the following code:

void test(float matrixValue) {
    cublasHandle_t handle;
    cublasCreate(&handle);

    float *A, *B, *C;
    cudaMallocHost(&A, sizeof(float) * 4);
    cudaMallocHost(&B, sizeof(float) * 4);
    cudaMallocHost(&C, sizeof(float) * 4);

    for (int i = 0; i < 4; i++) {
        A[i] = matrixValue;
        B[i] = matrixValue;
        C[i] = 123.0f;
    }

    float *dA, *dB, *dC;
    cudaMalloc(&dA, sizeof(float) * 4);
    cudaMalloc(&dB, sizeof(float) * 4);
    cudaMalloc(&dC, sizeof(float) * 4);
    
    cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);

    float alpha = 0.0f;
    float beta = 0.0f;

    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);

    cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 4; i++) {
        printf("%f ", C[i]);
    }
    printf("\n");

    cublasDestroy(handle);
    cudaFreeHost(A);
    cudaFreeHost(B);
    cudaFreeHost(C);
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
}

int main() {
    test(std::numeric_limits<float>::signaling_NaN());
    test(std::numeric_limits<float>::quiet_NaN());
}

We first allocate three buffers on the CPU (A, B, C), fill A and B with NANs, fill C with a random sentinel value (123), upload them to the GPU, perform a gemm on them using cublas, and download the result in the C buffer, then print the values in C. So far so good.

If we use alpha=1.0 when performing the gemm, the C buffer will be full of NaN, which is to be expected as any calculating involving a NaN will result in a NaN.

But if we use alpha=0, the C buffer will be full of zero even though 0 * NaN should result in NaN. The dC matrix is still overwritten, this means that the gemm (at least the accumulation part) is not skipped. I could not find in the nvidia doc where it states that the multiplication part of the gemm operation can be skipped when alpha is equal to zero. And in this case, does this make CuBLAS non-IEEE754 complient when either A or B contains a NaN and alpha is set to zero ?

答案1

得分: 4

这不是一个 bug,就我所知,这是期望的行为。

请注意 文档 中的说明:

有关参考,请参考以下链接:
sgemm, dgemm, cgemm, zgemm

这些链接指向 BLAS 参考 Fortran 代码。如果你查看那些代码,你会发现类似以下的行:

*
*     快速返回(如果可能)。
*
      IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
     +    (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
*     如果 alpha 等于零。
*
      IF (ALPHA.EQ.ZERO) THEN
          IF (BETA.EQ.ZERO) THEN
              DO 20 J = 1,N
                  DO 10 I = 1,M
                      C(I,J) = ZERO
   10             CONTINUE
   20         CONTINUE
          ELSE
              DO 40 J = 1,N
                  DO 30 I = 1,M
                      C(I,J) = BETA*C(I,J)
   30             CONTINUE
   40         CONTINUE
          END IF
          RETURN
      END IF

我不擅长阅读 Fortran,但这看起来非常像观察到的行为。因此,Cuda 遵循了事实上的标准。

从个人角度来看,我也认为这种行为是合适的,因为它与 beta 的期望行为相同。对于 beta,你绝对需要这种特殊情况,否则你永远无法重用内存而不显式地将其归零,否则一些随机的先前数据可能被解释为 NaN 或 Inf。

英文:

This is not a bug as far as I can tell but expected behavior.

Note that the docs read

> For references please refer to:
> sgemm, dgemm, cgemm, zgemm

Those links lead to the BLAS reference Fortran code. If you look at that, you will find lines such as

*
*     Quick return if possible.
*
      IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
     +    (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
*     And if  alpha.eq.zero.
*
      IF (ALPHA.EQ.ZERO) THEN
          IF (BETA.EQ.ZERO) THEN
              DO 20 J = 1,N
                  DO 10 I = 1,M
                      C(I,J) = ZERO
   10             CONTINUE
   20         CONTINUE
          ELSE
              DO 40 J = 1,N
                  DO 30 I = 1,M
                      C(I,J) = BETA*C(I,J)
   30             CONTINUE
   40         CONTINUE
          END IF
          RETURN
      END IF

I'm not good at reading Fortran but that very much seems like the observed behavior. So Cuda just follows the defacto standard.

Personally, I also find this behavior appropriate since it is the same you expect for beta. For beta you definitely need this special case because otherwise you could never reuse memory without explicitly zeroing it, otherwise some random previous data could be interpreted as NaN or Inf.

huangapple
  • 本文由 发表于 2023年3月9日 21:00:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/75684977.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定