英文:
Using jBLAS with NVBLAS
问题
我用相当巧妙的方法将 jBLAS 与 NVBLAS 编译在一起,因为配置脚本无法正确找到库。我手动编辑了 jBLAS 的 configure.out
文件,如下所示,以包含 NVBLAS 库。
BUILD_TYPE=nvblas
CC=gcc
CCC=c99
CFLAGS=-fPIC -DHAS_CPUID
F77=gfortran
FOUND_JAVA=true
FOUND_NM=true
INCDIRS=-Iinclude -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
LAPACK_HOME=./lapack-lite-3.1.1
LD=gcc
LDFLAGS=-shared
LIB=lib
LINKAGE_TYPE=static
LOADLIBES=-Wl,-z,muldefs /home/linyi/jblas/lapack-lite-3.1.1/lapack_LINUX.a /usr/local/cuda-11.0/lib64/libnvblas.so.11 /home/linyi/jblas/lapack-lite-3.1.1/blas_LINUX.a -lgfortran
MAKE=make
NM=nm
OS_ARCH=amd64
OS_ARCH_WITH_FLAVOR=amd64/sse3
OS_NAME=Linux
RUBY=ruby
SO=so
然后我运行了命令 make clean all
和 mvn clean package
,如这里所述。测试成功通过,但程序在退出时导致分段错误。
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.jblas.TestEigen
[NVBLAS] NVBLAS_CONFIG_FILE 环境变量设置为 '/home/linyi/nvblas.conf'
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.569 sec
...
Aborted (core dumped)
Results :
Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
我决定运行 mvn clean package -DskipTests
,因为测试似乎通过了,只是程序在终止时会导致分段错误。然而,当我在我的 Java 项目中使用该库时,nvblas.log
显示尽管 NVBLAS 拦截了对 BLAS 例程的调用,但实际上它们在 CPU 上执行,而不是在 GPU 上执行。使用 nvprof --print-gpu-summary
运行我的程序也得出了同样的结论。
==7711== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.8240us 1 1.8240us 1.8240us 1.8240us [CUDA memcpy HtoD]
======== Error: Application received signal 134
nvblas.log
的内容如下:
[NVBLAS] Using devices :0
[NVBLAS] Config parsed
[NVBLAS] dgemm[cpu]: ta=N, tb=N, m=1, n=1, k=1
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=24, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=32, k=28
...
我真的不知道该怎么办,希望有人能提供任何建议,这似乎真的文档不足。
英文:
I have compiled jBLAS with NVBLAS with a somewhat hacky solution, since the configure script was not correctly finding the libraries. I manually edited the configure.out
file of jBLAS like so, to include the NVBLAS libraries.
BUILD_TYPE=nvblas
CC=gcc
CCC=c99
CFLAGS=-fPIC -DHAS_CPUID
F77=gfortran
FOUND_JAVA=true
FOUND_NM=true
INCDIRS=-Iinclude -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
LAPACK_HOME=./lapack-lite-3.1.1
LD=gcc
LDFLAGS=-shared
LIB=lib
LINKAGE_TYPE=static
LOADLIBES=-Wl,-z,muldefs /home/linyi/jblas/lapack-lite-3.1.1/lapack_LINUX.a /usr/local/cuda-11.0/lib64/libnvblas.so.11 /home/linyi/jblas/lapack-lite-3.1.1/blas_LINUX.a -lgfortran
MAKE=make
NM=nm
OS_ARCH=amd64
OS_ARCH_WITH_FLAVOR=amd64/sse3
OS_NAME=Linux
RUBY=ruby
SO=so
I then ran the commands make clean all
and mvn clean package
as documented here. The tests pass successfully, but the program results in a segmentation fault on exit.
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running org.jblas.TestEigen
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/home/linyi/nvblas.conf'
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.569 sec
Running org.jblas.TestComplexFloat
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestDecompose
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.jblas.TestBlasDouble
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running org.jblas.TestBlasDoubleComplex
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestSingular
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running org.jblas.TestDoubleMatrix
Tests run: 37, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.022 sec
Running org.jblas.TestSolve
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestBlasFloat
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.TestFloatMatrix
Tests run: 37, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 sec
Running org.jblas.SimpleBlasTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running org.jblas.ranges.RangeTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running org.jblas.TestGeometry
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running org.jblas.ComplexDoubleMatrixTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334/libjblas.so
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334/libjblas_arch_flavor.so
-- org.jblas INFO Deleting /tmp/jblas4383455253907276334
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fa1f6bb96b1, pid=8063, tid=8072
#
# JRE version: OpenJDK Runtime Environment (11.0.8+10) (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C [libcublas.so.11+0xa096b1]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/linyi/jblas/jblas/core.8063)
#
# An error report file with more information is saved as:
# /home/linyi/jblas/jblas/hs_err_pid8063.log
#
# If you would like to submit a bug report, please visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#
Aborted (core dumped)
Results :
Tests run: 122, Failures: 0, Errors: 0, Skipped: 0
I decided to run mvn clean package -DskipTests
as it seemed that the tests were passing fine, just that the program causes a segmentation fault on termination. When I used the libary in my Java project however, nvblas.log
reveals that although NVBLAS was intercepting the calls to the BLAS routines, they were in fact being executed on the CPU instead of the GPU. Running nvprof --print-gpu-summary
with my program also resulted in the same conclusion.
#
==7711== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.8240us 1 1.8240us 1.8240us 1.8240us [CUDA memcpy HtoD]
======== Error: Application received signal 134
And the contents of nvblas.log
are as follows:
[NVBLAS] Using devices :0
[NVBLAS] Config parsed
[NVBLAS] dgemm[cpu]: ta=N, tb=N, m=1, n=1, k=1
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=24, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=32, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=26, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=22, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=20, k=28
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=26, k=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=60, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=54, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=52, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=60, n=31
[NVBLAS] dsyr2k[cpu]: up=U, ta=N, n=22, k=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=T, di=U, m=60, n=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=54, n=22
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=54, n=22, k=31
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=60, n=28
[NVBLAS] dtrmm[cpu]: si=R, up=U, ta=N, di=U, m=52, n=20
[NVBLAS] dtrmm[cpu]: si=R, up=L, ta=T, di=N, m=54, n=22
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=60, n=28, k=31
[NVBLAS] dgemm[cpu]: ta=T, tb=N, m=52, n=20, k=31
[NVBLAS] dgemm[cpu]: ta=N, tb=T, m=31, n=54, k=22
. . .
I am really at a loss as to what to do, I hope someone can offer any advice, this seems to be really poorly documented.
答案1
得分: 0
我相信我已经找到答案。
nm libjblas.so | grep -is dgemm
U cblas_dgemm
U dgemm_@@libnvblas.so.11
这表明库已经成功链接。然后我继续运行了jBlas的内置基准测试,只需运行java -jar jblas.jar
,其中jblas.jar
是编译好的库,显然只有在矩阵很大时才会进行GPU卸载,因为当n=10
或n=100
时没有进行任何GPU调用,但nvblas.log
在n=1000
时记录了GPU计算。对我来说,这非常令人困惑,希望这能帮助其他遇到这个问题的人。
英文:
I believe I have figured it out.
nm libjblas.so | grep -is dgemm
U cblas_dgemm
U dgemm_@@libnvblas.so.11
This shows that the libary is indeed linking correctly. I then proceeded to run the built in benchmark of jBlas, by just running java -jar jblas.jar
where jblas.jar
is the compiled library, and apparently GPU offload only occurs for large matrices, as no gpu calls were made when n=10
or n=100
but nvblas.log
logged GPU computation at n=1000
. This was very confusing for me, I hope this helps anyone else struggling with this issue.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论