使用 jBLAS 与 NVBLAS

huangapple go评论81阅读模式
英文:

Using jBLAS with NVBLAS

问题

我用相当巧妙的方法将 jBLAS 与 NVBLAS 编译在一起,因为配置脚本无法正确找到库。我手动编辑了 jBLAS 的 configure.out 文件,如下所示,以包含 NVBLAS 库。

BUILD_TYPE=nvblas
CC=gcc
CCC=c99
CFLAGS=-fPIC -DHAS_CPUID
F77=gfortran
FOUND_JAVA=true
FOUND_NM=true
INCDIRS=-Iinclude -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
LAPACK_HOME=./lapack-lite-3.1.1
LD=gcc
LDFLAGS=-shared
LIB=lib
LINKAGE_TYPE=static
LOADLIBES=-Wl,-z,muldefs /home/linyi/jblas/lapack-lite-3.1.1/lapack_LINUX.a /usr/local/cuda-11.0/lib64/libnvblas.so.11 /home/linyi/jblas/lapack-lite-3.1.1/blas_LINUX.a -lgfortran
MAKE=make
NM=nm
OS_ARCH=amd64
OS_ARCH_WITH_FLAVOR=amd64/sse3
OS_NAME=Linux
RUBY=ruby
SO=so

然后我运行了命令 make clean allmvn clean package,如这里所述。测试成功通过,但程序在退出时导致分段错误。

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.jblas.TestEigen
[NVBLAS] NVBLAS_CONFIG_FILE 环境变量设置为 '/home/linyi/nvblas.conf'
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.569 sec
...
Aborted (core dumped)
Results :
Tests run: 122, Failures: 0, Errors: 0, Skipped: 0

我决定运行 mvn clean package -DskipTests,因为测试似乎通过了,只是程序在终止时会导致分段错误。然而,当我在我的 Java 项目中使用该库时,nvblas.log 显示尽管 NVBLAS 拦截了对 BLAS 例程的调用,但实际上它们在 CPU 上执行,而不是在 GPU 上执行。使用 nvprof --print-gpu-summary 运行我的程序也得出了同样的结论。

==7711== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memcpy HtoD]
======== Error: Application received signal 134

nvblas.log 的内容如下:

[NVBLAS] Using devices :0
[NVBLAS] Config parsed
[NVBLAS] dgemm[cpu]: ta=N, tb=N, m=1, n=1, k=1
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=24, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=32, k=28
...

我真的不知道该怎么办,希望有人能提供任何建议,这似乎真的文档不足。

英文:

I have compiled jBLAS with NVBLAS with a somewhat hacky solution, since the configure script was not correctly finding the libraries. I manually edited the configure.out file of jBLAS like so, to include the NVBLAS libraries.

BUILD_TYPE=nvblas
CC=gcc
CCC=c99
CFLAGS=-fPIC -DHAS_CPUID
F77=gfortran
FOUND_JAVA=true
FOUND_NM=true
INCDIRS=-Iinclude -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
LAPACK_HOME=./lapack-lite-3.1.1
LD=gcc
LDFLAGS=-shared
LIB=lib
LINKAGE_TYPE=static
LOADLIBES=-Wl,-z,muldefs /home/linyi/jblas/lapack-lite-3.1.1/lapack_LINUX.a /usr/local/cuda-11.0/lib64/libnvblas.so.11 /home/linyi/jblas/lapack-lite-3.1.1/blas_LINUX.a -lgfortran
MAKE=make
NM=nm
OS_ARCH=amd64
OS_ARCH_WITH_FLAVOR=amd64/sse3
OS_NAME=Linux
RUBY=ruby
SO=so

I then ran the commands make clean all and mvn clean package as documented here. The tests pass successfully, but the program results in a segmentation fault on exit.

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.jblas.TestEigen
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/home/linyi/nvblas.conf'
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.569 sec
Running org.jblas.TestComplexFloat
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestDecompose
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.jblas.TestBlasDouble
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.jblas.TestBlasDoubleComplex
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestSingular
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.jblas.TestDoubleMatrix
Tests run: 37, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.jblas.TestSolve
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestBlasFloat
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestFloatMatrix
Tests run: 37, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.jblas.SimpleBlasTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.jblas.ranges.RangeTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.jblas.TestGeometry
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.ComplexDoubleMatrixTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334/libjblas.so
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334/libjblas_arch_flavor.so
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fa1f6bb96b1, pid=8063, tid=8072
#
# JRE version: OpenJDK Runtime Environment (11.0.8+10) (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libcublas.so.11+0xa096b1]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/linyi/jblas/jblas/core.8063)
#
# An error report file with more information is saved as:
# /home/linyi/jblas/jblas/hs_err_pid8063.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#
Aborted (core dumped)

Results :

Tests run: 122, Failures: 0, Errors: 0, Skipped: 0

I decided to run mvn clean package -DskipTests as it seemed that the tests were passing fine, just that the program causes a segmentation fault on termination. When I used the libary in my Java project however, nvblas.log reveals that although NVBLAS was intercepting the calls to the BLAS routines, they were in fact being executed on the CPU instead of the GPU. Running nvprof --print-gpu-summary with my program also resulted in the same conclusion.

#
==7711== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  1.8240us         1  1.8240us  1.8240us  1.8240us  [CUDA memcpy HtoD]
======== Error: Application received signal 134

And the contents of nvblas.log are as follows:

[NVBLAS] Using devices :0
[NVBLAS] Config parsed
[NVBLAS] dgemm[cpu]: ta=N, tb=N, m=1, n=1, k=1
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=24, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=32, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=26, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=22, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=20, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=26, k=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=60, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=60, n=31
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=22, k=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=60, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=54, n=22
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=54, n=22, k=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=60, n=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=52, n=20
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=54, n=22
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=60, n=28, k=31
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=52, n=20, k=31
[NVBLAS] dgemm[cpu]: ta=N, tb=T, m=31, n=54, k=22
. . .

I am really at a loss as to what to do, I hope someone can offer any advice, this seems to be really poorly documented.

答案1

得分: 0

我相信我已经找到答案。

nm libjblas.so | grep -is dgemm
         U cblas_dgemm
         U dgemm_@@libnvblas.so.11

这表明库已经成功链接。然后我继续运行了jBlas的内置基准测试,只需运行java -jar jblas.jar,其中jblas.jar是编译好的库,显然只有在矩阵很大时才会进行GPU卸载,因为当n=10n=100时没有进行任何GPU调用,但nvblas.logn=1000时记录了GPU计算。对我来说,这非常令人困惑,希望这能帮助其他遇到这个问题的人。

英文:

I believe I have figured it out.

nm libjblas.so | grep -is dgemm
         U cblas_dgemm
         U dgemm_@@libnvblas.so.11

This shows that the libary is indeed linking correctly. I then proceeded to run the built in benchmark of jBlas, by just running java -jar jblas.jar where jblas.jar is the compiled library, and apparently GPU offload only occurs for large matrices, as no gpu calls were made when n=10 or n=100 but nvblas.log logged GPU computation at n=1000. This was very confusing for me, I hope this helps anyone else struggling with this issue.

huangapple
  • 本文由 发表于 2020年8月18日 09:15:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/63460550.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定