2023年7月18日 10:54:03go评论74阅读模式

英文:

Detecting nearly singular matrix in CUDA

问题

我正在尝试计算许多矩阵的逆矩阵，最终是为了找到系统的解决方案。

我通过以下CUBLAS函数调用序列来实现这一点，

/*
旨在解决Ax = RHS;
x = Ainv * RHS;
*/

cublas<t>getrfBatched  // 找到A的LU分解
cublas<t>getriBatched  // 获取逆Ainv
cublas<t>gemmBatched   // 将Ainv与RHS相乘，得到解决方案

虽然cublas<t>getrfBatched会告诉您给定矩阵是否是奇异的，但它似乎不能检测到几乎奇异的矩阵（从数学上讲是奇异的，但由于截断等原因在计算机上不是奇异的）。

直接示例是

A = [[1 , 2 , 3],
     [4 , 5 , 6],
     [7 , 8 , 9]];

虽然cublas检测到transpose(A)是奇异的，但它不会检测到A是奇异的。事实上，Python的NumPy也不会检测到A是奇异的，我认为这是由于LU分解等类似算法的原因。

我如何能够在CUDA上以稳健的方式检测奇异矩阵？

添加

这是我用于此示例的代码行。

//基础-https://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot

#include &lt;cstdio&gt;
#include &lt;cstdlib&gt;
#include &lt;assert.h&gt;
#include &lt;cuda_runtime.h&gt;
#include &lt;cublas_v2.h&gt;

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, &quot;致命错误：%s（%s在%s：%d处）\n&quot;, \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, &quot;*** 失败 - 中止\n&quot;); \
            exit(1); \
        } \
    } while (0)

#define N 3
#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main() {

  const int numMat = 6;

  double h_A1[N*N];
  double h_A2[N*N];
  double h_A3[N*N];
  double h_A4[N*N];
  double h_A5[N*N];
  double h_A6[N*N];
  h_A1[IDX2C(0,0,N)] = 0.5; h_A1[IDX2C(0,1,N)] = 3. ; h_A1[IDX2C(0,2,N)] = 4. ;
  h_A1[IDX2C(1,0,N)] = 1. ; h_A1[IDX2C(1,1,N)] = 3. ; h_A1[IDX2C(1,2,N)] = 10.;
  h_A1[IDX2C(2,0,N)] = 4. ; h_A1[IDX2C(2,1,N)] = 9. ; h_A1[IDX2C(2,2,N)] = 16.;

  h_A2[IDX2C(0,0,N)] = 0. ; h_A2[IDX2C(0,1,N)] = 3. ; h_A2[IDX2C(0,2,N)] = 4. ;
  h_A2[IDX2C(1,0,N)] = 1. ; h_A2[IDX2C(1,1,N)] = 3. ; h_A2[IDX2C(1,2,N)] = 10.;
  h_A2[IDX2C(2,0,N)] = 4. ; h_A2[IDX2C(2,1,N)] = 9. ; h_A2[IDX2C(2,2,N)] = 16.;

  h_A3[IDX2C(0,0,N)] = 0. ; h_A3[IDX2C(0,1,N)] = 1. ; h_A3[IDX2C(0,2,N)] = 4. ;
  h_A3[IDX2C(1,0,N)] = 3. ; h_A3[IDX2C(1,1,N)] = 3. ; h_A3[IDX2C(1,2,N)] = 9. ;
  h_A3[IDX2C(2,0,N)] = 4. ; h_A3[IDX2C(2,1,N)] = 10.; h_A3[IDX2C(2,2,N)] = 16.;

  h_A4[IDX2C(0,0,N)] = 0. ; h_A4[IDX2C(0,1,N)] = 3. ; h_A4[IDX2C(0,2,N)] = 4. ;
  h_A4[IDX2C(1,0,N)] = 1. ; h_A4[IDX2C(1,1,N)] = 5. ; h_A4[IDX2C(1,2,N)] = 6. ;
  h_A4[IDX2C(2,0,N)] = 9. ; h_A4[IDX2C(2,1,N)] = 8. ; h_A4[IDX2C(2,2,N)] = 2. ;

  h_A5[IDX2C(0,0,N)] = 22.; h_A5[IDX2C(0,1,N)] = 3. ; h_A5[IDX2C(0,2,N)] = 4. ;
  h_A5[IDX2C(1,0,N)] = 1. ; h_A5[IDX2C(1,1,N)] = 5. ; h_A5[IDX2C(1,2,N)] = 6. ;
  h_A5[IDX2C(2,0,N)] = 9. ; h_A5[IDX2C(2,1,N)] = 8. ; h_A5[IDX2C(2,2,N)] = 2. ;

  h_A6[IDX2C(0,0,N)] = 1. ; h_A6[IDX2C(0,1,N)] = 2

<details>
<summary>英文:</summary>

I am trying to get inverse of many matrices, which is in the end to find solutions to the system.

I am doing this via the sequence of these CUBLAS function calls,

/*
Aims to solve Ax = RHS;
x = Ainv * RHS;
*/

cublas<t>getrfBatched // to find LU decomposition of A
cublas<t>getriBatched // to get inverse Ainv
cublas<t>gemmBatched // to multiply Ainv to RHS, to get the solution


While `cublas&lt;t&gt;getrfBatched` will tell whether the given matrix is singular or not, it seems that it does not detect nearly singular matrix (which is mathematically singular, but not singular on a computer due to truncations, etc).

The direct example is

A = [[1 , 2 , 3],
[4 , 5 , 6],
[7 , 8 , 9]];


While cublas detect singularity for `transpose(A)`, but it does not detect singularity for `A`. In fact, Python Numpy also does not detect singularity for `A` and I expect it is due to similar algorithms for LU decomposition and so on.

How I would be able to detect singular matrice in a robust way on CUDA?

***Added***

This is the lines of the code that I used for this example.

//base - https://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot

#include <cstdio>
#include <cstdlib>
#include <assert.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n",
msg, cudaGetErrorString(__err),
FILE, LINE);
fprintf(stderr, "*** FAILED - ABORTING\n");
exit(1);
}
} while (0)

#define N 3
#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main() {

const int numMat = 6;

double h_A1[NN];
double h_A2[NN];
double h_A3[NN];
double h_A4[NN];
double h_A5[NN];
double h_A6[NN];
h_A1[IDX2C(0,0,N)] = 0.5; h_A1[IDX2C(0,1,N)] = 3. ; h_A1[IDX2C(0,2,N)] = 4. ;
h_A1[IDX2C(1,0,N)] = 1. ; h_A1[IDX2C(1,1,N)] = 3. ; h_A1[IDX2C(1,2,N)] = 10.;
h_A1[IDX2C(2,0,N)] = 4. ; h_A1[IDX2C(2,1,N)] = 9. ; h_A1[IDX2C(2,2,N)] = 16.;

h_A2[IDX2C(0,0,N)] = 0. ; h_A2[IDX2C(0,1,N)] = 3. ; h_A2[IDX2C(0,2,N)] = 4. ;
h_A2[IDX2C(1,0,N)] = 1. ; h_A2[IDX2C(1,1,N)] = 3. ; h_A2[IDX2C(1,2,N)] = 10.;
h_A2[IDX2C(2,0,N)] = 4. ; h_A2[IDX2C(2,1,N)] = 9. ; h_A2[IDX2C(2,2,N)] = 16.;

h_A3[IDX2C(0,0,N)] = 0. ; h_A3[IDX2C(0,1,N)] = 1. ; h_A3[IDX2C(0,2,N)] = 4. ;
h_A3[IDX2C(1,0,N)] = 3. ; h_A3[IDX2C(1,1,N)] = 3. ; h_A3[IDX2C(1,2,N)] = 9. ;
h_A3[IDX2C(2,0,N)] = 4. ; h_A3[IDX2C(2,1,N)] = 10.; h_A3[IDX2C(2,2,N)] = 16.;

h_A4[IDX2C(0,0,N)] = 0. ; h_A4[IDX2C(0,1,N)] = 3. ; h_A4[IDX2C(0,2,N)] = 4. ;
h_A4[IDX2C(1,0,N)] = 1. ; h_A4[IDX2C(1,1,N)] = 5. ; h_A4[IDX2C(1,2,N)] = 6. ;
h_A4[IDX2C(2,0,N)] = 9. ; h_A4[IDX2C(2,1,N)] = 8. ; h_A4[IDX2C(2,2,N)] = 2. ;

h_A5[IDX2C(0,0,N)] = 22.; h_A5[IDX2C(0,1,N)] = 3. ; h_A5[IDX2C(0,2,N)] = 4. ;
h_A5[IDX2C(1,0,N)] = 1. ; h_A5[IDX2C(1,1,N)] = 5. ; h_A5[IDX2C(1,2,N)] = 6. ;
h_A5[IDX2C(2,0,N)] = 9. ; h_A5[IDX2C(2,1,N)] = 8. ; h_A5[IDX2C(2,2,N)] = 2. ;

h_A6[IDX2C(0,0,N)] = 1. ; h_A6[IDX2C(0,1,N)] = 2. ; h_A6[IDX2C(0,2,N)] = 3. ;
h_A6[IDX2C(1,0,N)] = 4. ; h_A6[IDX2C(1,1,N)] = 5. ; h_A6[IDX2C(1,2,N)] = 6. ;
h_A6[IDX2C(2,0,N)] = 7. ; h_A6[IDX2C(2,1,N)] = 8. ; h_A6[IDX2C(2,2,N)] = 9. ;

double h_rhs[N*N];
h_rhs[IDX2C(0,0,N)] = 1.; h_rhs[IDX2C(0,1,N)] = 1.; h_rhs[IDX2C(0,2,N)] = 1.;
h_rhs[IDX2C(1,0,N)] = 2.; h_rhs[IDX2C(1,1,N)] = 2.; h_rhs[IDX2C(1,2,N)] = 2.;
h_rhs[IDX2C(2,0,N)] = 3.; h_rhs[IDX2C(2,1,N)] = 3.; h_rhs[IDX2C(2,2,N)] = 3.;

for (int i=0; i<N*N; i++) { printf("%f ", h_rhs[i]); }
printf("\n");

double h_sol1[NN];
double h_sol2[NN];
double h_sol3[NN];
double h_sol4[NN];
double h_sol5[NN];
double h_sol6[NN];

double h_inv1[NN];
double h_inv2[NN];
double h_inv3[NN];
double h_inv4[NN];
double h_inv5[NN];
double h_inv6[NN];

double *h_APtr[numMat], *h_AinvPtr[numMat], *h_SolPtr[numMat];
h_APtr[0] = &(h_A1[0]);
h_APtr[1] = &(h_A2[0]);
h_APtr[2] = &(h_A3[0]);
h_APtr[3] = &(h_A4[0]);
h_APtr[4] = &(h_A5[0]);
h_APtr[5] = &(h_A6[0]);

h_AinvPtr[0] = &(h_inv1[0]);
h_AinvPtr[1] = &(h_inv2[0]);
h_AinvPtr[2] = &(h_inv3[0]);
h_AinvPtr[3] = &(h_inv4[0]);
h_AinvPtr[4] = &(h_inv5[0]);
h_AinvPtr[5] = &(h_inv6[0]);

h_SolPtr[0] = &(h_sol1[0]);
h_SolPtr[1] = &(h_sol2[0]);
h_SolPtr[2] = &(h_sol3[0]);
h_SolPtr[3] = &(h_sol4[0]);
h_SolPtr[4] = &(h_sol5[0]);
h_SolPtr[5] = &(h_sol6[0]);

// allocation
double d_A[numMat], d_Ainv[numMat]; // on Host
double **d_APtr, d_AinvPtr; // d_A version of Device
for (int i=0; i<numMat; i++) {
cudaMalloc((void)&d_A[i], NNsizeof(double)); // elements of d_A gets allocated on Device
cudaMalloc((void**)&d_Ainv[i], NNsizeof(double));
}
cudaMalloc((void**)&d_APtr, numMatsizeof(double ));
cudaMalloc((void*)&d_AinvPtr, numMatsizeof(double *));

double d_Rhs[numMat], d_Sol[numMat];
double **d_RhsPtr, d_SolPtr;
for (int i=0; i<numMat; i++) {
cudaMalloc((void)&d_Rhs[i], NNsizeof(double));
cudaMalloc((void**)&d_Sol[i], NNsizeof(double));
}
cudaMalloc((void**)&d_RhsPtr, numMatsizeof(double ));
cudaMalloc((void*)&d_SolPtr, numMatsizeof(double *));

int P, INFO; // specific to cublas<t>getrfBatched
cudaMalloc(&P, NnumMatsizeof(int));
cudaMalloc(&INFO, numMat*sizeof(int));

cudaCheckErrors("cudaMalloc fail");

// memcpy
for (int i=0; i<numMat; i++) {
cudaMemcpy(d_A[i], h_APtr[i], NNsizeof(double), cudaMemcpyHostToDevice);
}
cudaMemcpy(d_APtr, d_A, numMat*sizeof(double ), cudaMemcpyHostToDevice);
cudaMemcpy(d_AinvPtr, d_Ainv, numMatsizeof(double *), cudaMemcpyHostToDevice);

for (int i=0; i<numMat; i++) {
cudaMemcpy(d_Rhs[i], h_rhs, NNsizeof(double), cudaMemcpyHostToDevice);
}
cudaMemcpy(d_RhsPtr, d_Rhs, numMat*sizeof(double ), cudaMemcpyHostToDevice);
cudaMemcpy(d_SolPtr, d_Sol, numMatsizeof(double *), cudaMemcpyHostToDevice);

cudaCheckErrors("cudaMemcpy H2D fail");

// CUBLAS
cublasHandle_t myhandle;
cublasStatus_t cublas_result;

cublas_result = cublasCreate_v2(&myhandle);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

// getrfBatched
cublasDgetrfBatched(myhandle, N, d_APtr, N, P, INFO, numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

int INFOh[numMat];
cudaMemcpy(&INFOh, INFO, numMat*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy H2D fail (2)");

for (int i=0; i<numMat; i++) {
if (INFOh[i] != 0) {
fprintf(stderr, "ERROR: something wrong in INFOh\n");
fprintf(stderr, " - batch %d, INFO %d\n", i, INFOh[i]);
}
}

// getriBatched
cublasDgetriBatched(myhandle, N, d_APtr, N, P, d_AinvPtr, N, INFO, numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

cudaMemcpy(&INFOh, INFO, numMat*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy H2D fail (3)");

// print inverse
printf("\n Inverses \n");
for (int i=0; i<numMat; i++) {
cudaMemcpy(h_AinvPtr[i], d_Ainv[i], NNsizeof(double), cudaMemcpyDeviceToHost);
}
cudaCheckErrors("cudaMemcpy D2H fail");

double *tmpptr, *tmpptrInv;
for (int k=0; k<numMat; k++) {
tmpptr = h_APtr[k];
tmpptrInv = h_AinvPtr[k];

for (int i=0; i&lt;N; i++) {
  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptr[IDX2C(i,j,N)]);
  }

  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptrInv[IDX2C(i,j,N)]);
  }

  printf(&quot;\n&quot;);
}
printf(&quot;\n&quot;);

}

// gemmBatched
double alpha = 1.0;
double beta = 0.0;
cublas_result = cublasDgemmBatched(myhandle, CUBLAS_OP_N, CUBLAS_OP_N,
N,N,N,
&alpha, d_AinvPtr, N, d_RhsPtr, N,
&beta, d_SolPtr, N,
numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

cudaFree(P);
cudaFree(INFO);
cublasDestroy_v2(myhandle);

// print solution
printf("\n Solutions \n");
for (int i=0; i<numMat; i++) {
cudaMemcpy(h_SolPtr[i], d_Sol[i], NNsizeof(double), cudaMemcpyDeviceToHost);
cudaFree(d_A[i]);
cudaFree(d_Ainv[i]);
}
cudaFree(d_APtr);
cudaFree(d_AinvPtr);
cudaCheckErrors("cudaMemcpy D2H fail (2)");

for (int k=0; k<numMat; k++) {
tmpptr = h_APtr[k];
tmpptrInv = h_SolPtr[k];

for (int i=0; i&lt;N; i++) {
  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptr[IDX2C(i,j,N)]);
  }

  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptrInv[IDX2C(i,j,N)]);
  }

  printf(&quot;\n&quot;);
}
printf(&quot;\n&quot;);

}

return 0;
}


This code is messy, and maybe some allocations are not properly freed till the end of the code (I had put more focus on make it run and understand the CUBLAS application). I do not want you to read through this messy code, just wanted to demonstrate indeed CUBLAS does not detect singular matrix in some cases with directly compilable, and runable set of code lines.

It solves batched 6 systems, each of them consists of 3x3 A, and share the same 3x3 RHS.

It is the `h_A6` which should be detected as singular.

When I run this, the result looks like

...
Inverses
...
1.000000 2.000000 3.000000 3152519739159346.500000 -6305039478318690.000000 3152519739159346.000000
4.000000 5.000000 6.000000 -6305039478318695.000000 12610078956637384.000000 -6305039478318692.000000
7.000000 8.000000 9.000000 3152519739159348.000000 -6305039478318693.000000 3152519739159346.000000


We can see that for some singular matrix, it gives absurd result even with double precision.

</details>


# 答案1
**得分**: 1

在这种情况下，常规的诊断方法是使用矩阵的[条件数][1]。 LAPACK提供了[GECON][2]用于此目的，使用您已经计算的LU分解实现自己的版本不应特别困难。

  [1]: https://en.m.wikipedia.org/wiki/Condition_number
  [2]: https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga188b8d30443d14b1a3f7f8331d87ae60.html

<details>
<summary>英文:</summary>

The conventional diagnostic to use in this situation is the [condition number][1] of the matrix. LAPACK offers [GECON][2] for this purpose, and it shouldn’t be particularly difficult to implement your own version using the LU factorisation you are already calculating.


  [1]: https://en.m.wikipedia.org/wiki/Condition_number
  [2]: https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga188b8d30443d14b1a3f7f8331d87ae60.html

</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Detecting nearly singular matrix in CUDA

问题

在DirectX 12中读取CUDA缓冲区

CUDA统一内存如何从设备预取到主机？

将Python脚本转换为在GPU（CUDA）上运行。

混淆关于CUDA核心数量和并行线程数量的情况

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论