Detecting nearly singular matrix in CUDA

huangapple go评论51阅读模式
英文:

Detecting nearly singular matrix in CUDA

问题

我正在尝试计算许多矩阵的逆矩阵,最终是为了找到系统的解决方案。

我通过以下CUBLAS函数调用序列来实现这一点,

/*
旨在解决Ax = RHS;
x = Ainv * RHS;
*/

cublas<t>getrfBatched  // 找到A的LU分解
cublas<t>getriBatched  // 获取逆Ainv
cublas<t>gemmBatched   // 将Ainv与RHS相乘,得到解决方案

虽然cublas<t>getrfBatched会告诉您给定矩阵是否是奇异的,但它似乎不能检测到几乎奇异的矩阵(从数学上讲是奇异的,但由于截断等原因在计算机上不是奇异的)。

直接示例是

A = [[1 , 2 , 3],
     [4 , 5 , 6],
     [7 , 8 , 9]];

虽然cublas检测到transpose(A)是奇异的,但它不会检测到A是奇异的。事实上,Python的NumPy也不会检测到A是奇异的,我认为这是由于LU分解等类似算法的原因。

我如何能够在CUDA上以稳健的方式检测奇异矩阵?

添加

这是我用于此示例的代码行。

//基础-https://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot

#include &lt;cstdio&gt;
#include &lt;cstdlib&gt;
#include &lt;assert.h&gt;
#include &lt;cuda_runtime.h&gt;
#include &lt;cublas_v2.h&gt;

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, &quot;致命错误:%s(%s在%s:%d处)\n&quot;, \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, &quot;*** 失败 - 中止\n&quot;); \
            exit(1); \
        } \
    } while (0)

#define N 3
#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main() {

  const int numMat = 6;

  double h_A1[N*N];
  double h_A2[N*N];
  double h_A3[N*N];
  double h_A4[N*N];
  double h_A5[N*N];
  double h_A6[N*N];
  h_A1[IDX2C(0,0,N)] = 0.5; h_A1[IDX2C(0,1,N)] = 3. ; h_A1[IDX2C(0,2,N)] = 4. ;
  h_A1[IDX2C(1,0,N)] = 1. ; h_A1[IDX2C(1,1,N)] = 3. ; h_A1[IDX2C(1,2,N)] = 10.;
  h_A1[IDX2C(2,0,N)] = 4. ; h_A1[IDX2C(2,1,N)] = 9. ; h_A1[IDX2C(2,2,N)] = 16.;

  h_A2[IDX2C(0,0,N)] = 0. ; h_A2[IDX2C(0,1,N)] = 3. ; h_A2[IDX2C(0,2,N)] = 4. ;
  h_A2[IDX2C(1,0,N)] = 1. ; h_A2[IDX2C(1,1,N)] = 3. ; h_A2[IDX2C(1,2,N)] = 10.;
  h_A2[IDX2C(2,0,N)] = 4. ; h_A2[IDX2C(2,1,N)] = 9. ; h_A2[IDX2C(2,2,N)] = 16.;

  h_A3[IDX2C(0,0,N)] = 0. ; h_A3[IDX2C(0,1,N)] = 1. ; h_A3[IDX2C(0,2,N)] = 4. ;
  h_A3[IDX2C(1,0,N)] = 3. ; h_A3[IDX2C(1,1,N)] = 3. ; h_A3[IDX2C(1,2,N)] = 9. ;
  h_A3[IDX2C(2,0,N)] = 4. ; h_A3[IDX2C(2,1,N)] = 10.; h_A3[IDX2C(2,2,N)] = 16.;

  h_A4[IDX2C(0,0,N)] = 0. ; h_A4[IDX2C(0,1,N)] = 3. ; h_A4[IDX2C(0,2,N)] = 4. ;
  h_A4[IDX2C(1,0,N)] = 1. ; h_A4[IDX2C(1,1,N)] = 5. ; h_A4[IDX2C(1,2,N)] = 6. ;
  h_A4[IDX2C(2,0,N)] = 9. ; h_A4[IDX2C(2,1,N)] = 8. ; h_A4[IDX2C(2,2,N)] = 2. ;

  h_A5[IDX2C(0,0,N)] = 22.; h_A5[IDX2C(0,1,N)] = 3. ; h_A5[IDX2C(0,2,N)] = 4. ;
  h_A5[IDX2C(1,0,N)] = 1. ; h_A5[IDX2C(1,1,N)] = 5. ; h_A5[IDX2C(1,2,N)] = 6. ;
  h_A5[IDX2C(2,0,N)] = 9. ; h_A5[IDX2C(2,1,N)] = 8. ; h_A5[IDX2C(2,2,N)] = 2. ;

  h_A6[IDX2C(0,0,N)] = 1. ; h_A6[IDX2C(0,1,N)] = 2

<details>
<summary>英文:</summary>

I am trying to get inverse of many matrices, which is in the end to find solutions to the system.

I am doing this via the sequence of these CUBLAS function calls,

/*
Aims to solve Ax = RHS;
x = Ainv * RHS;
*/

cublas<t>getrfBatched // to find LU decomposition of A
cublas<t>getriBatched // to get inverse Ainv
cublas<t>gemmBatched // to multiply Ainv to RHS, to get the solution


While `cublas&lt;t&gt;getrfBatched` will tell whether the given matrix is singular or not, it seems that it does not detect nearly singular matrix (which is mathematically singular, but not singular on a computer due to truncations, etc).

The direct example is 

A = [[1 , 2 , 3],
[4 , 5 , 6],
[7 , 8 , 9]];


While cublas detect singularity for `transpose(A)`, but it does not detect singularity for `A`. In fact, Python Numpy also does not detect singularity for `A` and I expect it is due to similar algorithms for LU decomposition and so on.

How I would be able to detect singular matrice in a robust way on CUDA?

***Added***

This is the lines of the code that I used for this example.

//base - https://stackoverflow.com/questions/22887167/cublas-incorrect-inversion-for-matrix-with-zero-pivot

#include <cstdio>
#include <cstdlib>
#include <assert.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>

#define cudaCheckErrors(msg)
do {
cudaError_t __err = cudaGetLastError();
if (__err != cudaSuccess) {
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n",
msg, cudaGetErrorString(__err),
FILE, LINE);
fprintf(stderr, "*** FAILED - ABORTING\n");
exit(1);
}
} while (0)

#define N 3
#define IDX2C(i,j,ld) (((j)*(ld))+(i))

int main() {

const int numMat = 6;

double h_A1[NN];
double h_A2[N
N];
double h_A3[NN];
double h_A4[N
N];
double h_A5[NN];
double h_A6[N
N];
h_A1[IDX2C(0,0,N)] = 0.5; h_A1[IDX2C(0,1,N)] = 3. ; h_A1[IDX2C(0,2,N)] = 4. ;
h_A1[IDX2C(1,0,N)] = 1. ; h_A1[IDX2C(1,1,N)] = 3. ; h_A1[IDX2C(1,2,N)] = 10.;
h_A1[IDX2C(2,0,N)] = 4. ; h_A1[IDX2C(2,1,N)] = 9. ; h_A1[IDX2C(2,2,N)] = 16.;

h_A2[IDX2C(0,0,N)] = 0. ; h_A2[IDX2C(0,1,N)] = 3. ; h_A2[IDX2C(0,2,N)] = 4. ;
h_A2[IDX2C(1,0,N)] = 1. ; h_A2[IDX2C(1,1,N)] = 3. ; h_A2[IDX2C(1,2,N)] = 10.;
h_A2[IDX2C(2,0,N)] = 4. ; h_A2[IDX2C(2,1,N)] = 9. ; h_A2[IDX2C(2,2,N)] = 16.;

h_A3[IDX2C(0,0,N)] = 0. ; h_A3[IDX2C(0,1,N)] = 1. ; h_A3[IDX2C(0,2,N)] = 4. ;
h_A3[IDX2C(1,0,N)] = 3. ; h_A3[IDX2C(1,1,N)] = 3. ; h_A3[IDX2C(1,2,N)] = 9. ;
h_A3[IDX2C(2,0,N)] = 4. ; h_A3[IDX2C(2,1,N)] = 10.; h_A3[IDX2C(2,2,N)] = 16.;

h_A4[IDX2C(0,0,N)] = 0. ; h_A4[IDX2C(0,1,N)] = 3. ; h_A4[IDX2C(0,2,N)] = 4. ;
h_A4[IDX2C(1,0,N)] = 1. ; h_A4[IDX2C(1,1,N)] = 5. ; h_A4[IDX2C(1,2,N)] = 6. ;
h_A4[IDX2C(2,0,N)] = 9. ; h_A4[IDX2C(2,1,N)] = 8. ; h_A4[IDX2C(2,2,N)] = 2. ;

h_A5[IDX2C(0,0,N)] = 22.; h_A5[IDX2C(0,1,N)] = 3. ; h_A5[IDX2C(0,2,N)] = 4. ;
h_A5[IDX2C(1,0,N)] = 1. ; h_A5[IDX2C(1,1,N)] = 5. ; h_A5[IDX2C(1,2,N)] = 6. ;
h_A5[IDX2C(2,0,N)] = 9. ; h_A5[IDX2C(2,1,N)] = 8. ; h_A5[IDX2C(2,2,N)] = 2. ;

h_A6[IDX2C(0,0,N)] = 1. ; h_A6[IDX2C(0,1,N)] = 2. ; h_A6[IDX2C(0,2,N)] = 3. ;
h_A6[IDX2C(1,0,N)] = 4. ; h_A6[IDX2C(1,1,N)] = 5. ; h_A6[IDX2C(1,2,N)] = 6. ;
h_A6[IDX2C(2,0,N)] = 7. ; h_A6[IDX2C(2,1,N)] = 8. ; h_A6[IDX2C(2,2,N)] = 9. ;

double h_rhs[N*N];
h_rhs[IDX2C(0,0,N)] = 1.; h_rhs[IDX2C(0,1,N)] = 1.; h_rhs[IDX2C(0,2,N)] = 1.;
h_rhs[IDX2C(1,0,N)] = 2.; h_rhs[IDX2C(1,1,N)] = 2.; h_rhs[IDX2C(1,2,N)] = 2.;
h_rhs[IDX2C(2,0,N)] = 3.; h_rhs[IDX2C(2,1,N)] = 3.; h_rhs[IDX2C(2,2,N)] = 3.;

for (int i=0; i<N*N; i++) { printf("%f ", h_rhs[i]); }
printf("\n");

double h_sol1[NN];
double h_sol2[N
N];
double h_sol3[NN];
double h_sol4[N
N];
double h_sol5[NN];
double h_sol6[N
N];

double h_inv1[NN];
double h_inv2[N
N];
double h_inv3[NN];
double h_inv4[N
N];
double h_inv5[NN];
double h_inv6[N
N];

double *h_APtr[numMat], *h_AinvPtr[numMat], *h_SolPtr[numMat];
h_APtr[0] = &(h_A1[0]);
h_APtr[1] = &(h_A2[0]);
h_APtr[2] = &(h_A3[0]);
h_APtr[3] = &(h_A4[0]);
h_APtr[4] = &(h_A5[0]);
h_APtr[5] = &(h_A6[0]);

h_AinvPtr[0] = &(h_inv1[0]);
h_AinvPtr[1] = &(h_inv2[0]);
h_AinvPtr[2] = &(h_inv3[0]);
h_AinvPtr[3] = &(h_inv4[0]);
h_AinvPtr[4] = &(h_inv5[0]);
h_AinvPtr[5] = &(h_inv6[0]);

h_SolPtr[0] = &(h_sol1[0]);
h_SolPtr[1] = &(h_sol2[0]);
h_SolPtr[2] = &(h_sol3[0]);
h_SolPtr[3] = &(h_sol4[0]);
h_SolPtr[4] = &(h_sol5[0]);
h_SolPtr[5] = &(h_sol6[0]);

// allocation
double d_A[numMat], d_Ainv[numMat]; // on Host
double **d_APtr, d_AinvPtr; // d_A version of Device
for (int i=0; i<numMat; i++) {
cudaMalloc((void
)&d_A[i], N
N
sizeof(double)); // elements of d_A gets allocated on Device
cudaMalloc((void**)&d_Ainv[i], NNsizeof(double));
}
cudaMalloc((void**)&d_APtr, numMatsizeof(double ));
cudaMalloc((void
*)&d_AinvPtr, numMat
sizeof(double *));

double d_Rhs[numMat], d_Sol[numMat];
double **d_RhsPtr, d_SolPtr;
for (int i=0; i<numMat; i++) {
cudaMalloc((void
)&d_Rhs[i], N
N
sizeof(double));
cudaMalloc((void**)&d_Sol[i], NNsizeof(double));
}
cudaMalloc((void**)&d_RhsPtr, numMatsizeof(double ));
cudaMalloc((void
*)&d_SolPtr, numMat
sizeof(double *));

int P, INFO; // specific to cublas<t>getrfBatched
cudaMalloc(&P, N
numMat
sizeof(int));
cudaMalloc(&INFO, numMat*sizeof(int));

cudaCheckErrors("cudaMalloc fail");

// memcpy
for (int i=0; i<numMat; i++) {
cudaMemcpy(d_A[i], h_APtr[i], NNsizeof(double), cudaMemcpyHostToDevice);
}
cudaMemcpy(d_APtr, d_A, numMat*sizeof(double ), cudaMemcpyHostToDevice);
cudaMemcpy(d_AinvPtr, d_Ainv, numMat
sizeof(double *), cudaMemcpyHostToDevice);

for (int i=0; i<numMat; i++) {
cudaMemcpy(d_Rhs[i], h_rhs, NNsizeof(double), cudaMemcpyHostToDevice);
}
cudaMemcpy(d_RhsPtr, d_Rhs, numMat*sizeof(double ), cudaMemcpyHostToDevice);
cudaMemcpy(d_SolPtr, d_Sol, numMat
sizeof(double *), cudaMemcpyHostToDevice);

cudaCheckErrors("cudaMemcpy H2D fail");

// CUBLAS
cublasHandle_t myhandle;
cublasStatus_t cublas_result;

cublas_result = cublasCreate_v2(&myhandle);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

// getrfBatched
cublasDgetrfBatched(myhandle, N, d_APtr, N, P, INFO, numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

int INFOh[numMat];
cudaMemcpy(&INFOh, INFO, numMat*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy H2D fail (2)");

for (int i=0; i<numMat; i++) {
if (INFOh[i] != 0) {
fprintf(stderr, "ERROR: something wrong in INFOh\n");
fprintf(stderr, " - batch %d, INFO %d\n", i, INFOh[i]);
}
}

// getriBatched
cublasDgetriBatched(myhandle, N, d_APtr, N, P, d_AinvPtr, N, INFO, numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

cudaMemcpy(&INFOh, INFO, numMat*sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy H2D fail (3)");

// print inverse
printf("\n Inverses \n");
for (int i=0; i<numMat; i++) {
cudaMemcpy(h_AinvPtr[i], d_Ainv[i], NNsizeof(double), cudaMemcpyDeviceToHost);
}
cudaCheckErrors("cudaMemcpy D2H fail");

double *tmpptr, *tmpptrInv;
for (int k=0; k<numMat; k++) {
tmpptr = h_APtr[k];
tmpptrInv = h_AinvPtr[k];

for (int i=0; i&lt;N; i++) {
  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptr[IDX2C(i,j,N)]);
  }

  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptrInv[IDX2C(i,j,N)]);
  }

  printf(&quot;\n&quot;);
}
printf(&quot;\n&quot;);

}

// gemmBatched
double alpha = 1.0;
double beta = 0.0;
cublas_result = cublasDgemmBatched(myhandle, CUBLAS_OP_N, CUBLAS_OP_N,
N,N,N,
&alpha, d_AinvPtr, N, d_RhsPtr, N,
&beta, d_SolPtr, N,
numMat);
assert(cublas_result == CUBLAS_STATUS_SUCCESS);

cudaFree(P);
cudaFree(INFO);
cublasDestroy_v2(myhandle);

// print solution
printf("\n Solutions \n");
for (int i=0; i<numMat; i++) {
cudaMemcpy(h_SolPtr[i], d_Sol[i], NNsizeof(double), cudaMemcpyDeviceToHost);
cudaFree(d_A[i]);
cudaFree(d_Ainv[i]);
}
cudaFree(d_APtr);
cudaFree(d_AinvPtr);
cudaCheckErrors("cudaMemcpy D2H fail (2)");

for (int k=0; k<numMat; k++) {
tmpptr = h_APtr[k];
tmpptrInv = h_SolPtr[k];

for (int i=0; i&lt;N; i++) {
  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptr[IDX2C(i,j,N)]);
  }

  for (int j=0; j&lt;N; j++) {
    printf(&quot;%f\t&quot;, tmpptrInv[IDX2C(i,j,N)]);
  }

  printf(&quot;\n&quot;);
}
printf(&quot;\n&quot;);

}

return 0;
}


This code is messy, and maybe some allocations are not properly freed till the end of the code (I had put more focus on make it run and understand the CUBLAS application). I do not want you to read through this messy code, just wanted to demonstrate indeed CUBLAS does not detect singular matrix in some cases with directly compilable, and runable set of code lines.

It solves batched 6 systems, each of them consists of 3x3 A, and share the same 3x3 RHS.

It is the `h_A6` which should be detected as singular.

When I run this, the result looks like

...
Inverses
...
1.000000 2.000000 3.000000 3152519739159346.500000 -6305039478318690.000000 3152519739159346.000000
4.000000 5.000000 6.000000 -6305039478318695.000000 12610078956637384.000000 -6305039478318692.000000
7.000000 8.000000 9.000000 3152519739159348.000000 -6305039478318693.000000 3152519739159346.000000


We can see that for some singular matrix, it gives absurd result even with double precision.

</details>


# 答案1
**得分**: 1

在这种情况下,常规的诊断方法是使用矩阵的[条件数][1]。 LAPACK提供了[GECON][2]用于此目的,使用您已经计算的LU分解实现自己的版本不应特别困难。

  [1]: https://en.m.wikipedia.org/wiki/Condition_number
  [2]: https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga188b8d30443d14b1a3f7f8331d87ae60.html

<details>
<summary>英文:</summary>

The conventional diagnostic to use in this situation is the [condition number][1] of the matrix. LAPACK offers [GECON][2] for this purpose, and it shouldn’t be particularly difficult to implement your own version using the LU factorisation you are already calculating.


  [1]: https://en.m.wikipedia.org/wiki/Condition_number
  [2]: https://netlib.org/lapack/explore-html/dd/d9a/group__double_g_ecomputational_ga188b8d30443d14b1a3f7f8331d87ae60.html

</details>



huangapple
  • 本文由 发表于 2023年7月18日 10:54:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76709252.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定