英文:
Cublas gemms not respecting NaN inputs
问题
我遇到了cublas的奇怪行为。
考虑以下代码:
void test(float matrixValue) {
cublasHandle_t handle;
cublasCreate(&handle);
float *A, *B, *C;
cudaMallocHost(&A, sizeof(float) * 4);
cudaMallocHost(&B, sizeof(float) * 4);
cudaMallocHost(&C, sizeof(float) * 4);
for (int i = 0; i < 4; i++) {
A[i] = matrixValue;
B[i] = matrixValue;
C[i] = 123.0f;
}
float *dA, *dB, *dC;
cudaMalloc(&dA, sizeof(float) * 4);
cudaMalloc(&dB, sizeof(float) * 4);
cudaMalloc(&dC, sizeof(float) * 4);
cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);
float alpha = 0.0f;
float beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);
cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);
for (int i = 0; i < 4; i++) {
printf("%f ", C[i]);
}
printf("\n");
cublasDestroy(handle);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
}
int main() {
test(std::numeric_limits<float>::signaling_NaN());
test(std::numeric_limits<float>::quiet_NaN());
}
我们首先在CPU上分配了三个缓冲区(A、B、C),用NaN填充A和B,用随机的标志值(123)填充C,然后将它们上传到GPU,在它们上执行gemm操作,将结果下载到C缓冲区,然后打印C中的值。到目前为止一切都好。
如果我们在执行gemm时使用alpha=1.0,那么C缓冲区将充满NaN,这是可以预期的,因为涉及NaN的任何计算都将导致NaN。
但如果我们使用alpha=0,即使0 * NaN
应该结果为NaN
,C缓冲区仍然充满零。dC矩阵仍然被覆盖,这意味着gemm(至少是累积部分)没有被跳过。我在Nvidia文档中找不到说明,说明当alpha等于零时,gemm操作的乘法部分可以被跳过。在这种情况下,这是否使CuBLAS不符合IEEE754标准,当A或B包含NaN并且alpha设置为零时?
英文:
I came across a strange behaviour with cublas.
Consider the following code:
void test(float matrixValue) {
cublasHandle_t handle;
cublasCreate(&handle);
float *A, *B, *C;
cudaMallocHost(&A, sizeof(float) * 4);
cudaMallocHost(&B, sizeof(float) * 4);
cudaMallocHost(&C, sizeof(float) * 4);
for (int i = 0; i < 4; i++) {
A[i] = matrixValue;
B[i] = matrixValue;
C[i] = 123.0f;
}
float *dA, *dB, *dC;
cudaMalloc(&dA, sizeof(float) * 4);
cudaMalloc(&dB, sizeof(float) * 4);
cudaMalloc(&dC, sizeof(float) * 4);
cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);
float alpha = 0.0f;
float beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);
cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);
for (int i = 0; i < 4; i++) {
printf("%f ", C[i]);
}
printf("\n");
cublasDestroy(handle);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
}
int main() {
test(std::numeric_limits<float>::signaling_NaN());
test(std::numeric_limits<float>::quiet_NaN());
}
We first allocate three buffers on the CPU (A, B, C), fill A and B with NANs, fill C with a random sentinel value (123), upload them to the GPU, perform a gemm on them using cublas, and download the result in the C buffer, then print the values in C. So far so good.
If we use alpha=1.0 when performing the gemm, the C buffer will be full of NaN, which is to be expected as any calculating involving a NaN will result in a NaN.
But if we use alpha=0, the C buffer will be full of zero even though 0 * NaN
should result in NaN
. The dC matrix is still overwritten, this means that the gemm (at least the accumulation part) is not skipped. I could not find in the nvidia doc where it states that the multiplication part of the gemm operation can be skipped when alpha is equal to zero. And in this case, does this make CuBLAS non-IEEE754 complient when either A or B contains a NaN and alpha is set to zero ?
答案1
得分: 4
这不是一个 bug,就我所知,这是期望的行为。
请注意 文档 中的说明:
这些链接指向 BLAS 参考 Fortran 代码。如果你查看那些代码,你会发现类似以下的行:
*
* 快速返回(如果可能)。
*
IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
+ (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
* 如果 alpha 等于零。
*
IF (ALPHA.EQ.ZERO) THEN
IF (BETA.EQ.ZERO) THEN
DO 20 J = 1,N
DO 10 I = 1,M
C(I,J) = ZERO
10 CONTINUE
20 CONTINUE
ELSE
DO 40 J = 1,N
DO 30 I = 1,M
C(I,J) = BETA*C(I,J)
30 CONTINUE
40 CONTINUE
END IF
RETURN
END IF
我不擅长阅读 Fortran,但这看起来非常像观察到的行为。因此,Cuda 遵循了事实上的标准。
从个人角度来看,我也认为这种行为是合适的,因为它与 beta
的期望行为相同。对于 beta,你绝对需要这种特殊情况,否则你永远无法重用内存而不显式地将其归零,否则一些随机的先前数据可能被解释为 NaN 或 Inf。
英文:
This is not a bug as far as I can tell but expected behavior.
Note that the docs read
> For references please refer to:
> sgemm, dgemm, cgemm, zgemm
Those links lead to the BLAS reference Fortran code. If you look at that, you will find lines such as
*
* Quick return if possible.
*
IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
+ (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
* And if alpha.eq.zero.
*
IF (ALPHA.EQ.ZERO) THEN
IF (BETA.EQ.ZERO) THEN
DO 20 J = 1,N
DO 10 I = 1,M
C(I,J) = ZERO
10 CONTINUE
20 CONTINUE
ELSE
DO 40 J = 1,N
DO 30 I = 1,M
C(I,J) = BETA*C(I,J)
30 CONTINUE
40 CONTINUE
END IF
RETURN
END IF
I'm not good at reading Fortran but that very much seems like the observed behavior. So Cuda just follows the defacto standard.
Personally, I also find this behavior appropriate since it is the same you expect for beta
. For beta you definitely need this special case because otherwise you could never reuse memory without explicitly zeroing it, otherwise some random previous data could be interpreted as NaN or Inf.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论