为什么 linalg.solve 不使用所有可用的线程

huangapple go评论60阅读模式
英文:

Why does linalg.solve doesnt use all available threads

问题

我想知道为什么numpy的linalg.solve函数没有使用所有可用的线程来执行计算。

我正在使用它来解决多维系统,以一种方式解决问题,应该找到一个具有27个条目的向量,n.N.N次。因此,由于它必须多次执行相同类型的计算,我认为它会利用所有可用的线程,但实际上并没有发生这种情况。它只使用16个线程,即使我更改n,已经测试了n = 300和400。我还尝试手动指定线程,来自threadpoolctl,它只能减少运行线程的数量,但性能没有变化(不知道为什么)。

测试代码如下:

import numpy as np
from numpy import *
from timeit import default_timer as timer
import math;

import os, psutil
process = psutil.Process(os.getpid())

###### Main time loop ##########################################################

n = 400
N = 300

s = timer()

k_star = np.random.rand(27*n*N*N).reshape((27,n,N,N))
T = np.random.rand(27*27*n*N*N).reshape((27,27,n,N,N))

t1 = timer() - s

print("分配k_star和T的时间 = " + str(t1))

for time in range(0,1001):

    s = timer()

    fout = np.transpose(linalg.solve(np.transpose(T),np.transpose(k_star)))

    t = timer() - s

    print("计算fout解决系统的时间 = " + str(t))

    print(time)
    
    if time%1000 == 0:
        memory = process.memory_info().rss/1000000000
        print("RAM内存占用 = " + str(round(memory,2)) + " Gb") 

对于答案可能有用的是np.__config__.show()的输出如下:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL,AVX512_ICL

我将感激任何帮助来理解发生了什么,以及如何通过使用所有可用的线程来提高运行代码的性能。提前感谢您的帮助。

英文:

I would like to know why linalg.solve from numpy isnt using all available threads to do its calculus.

I'm using it to solve for a multidimensional system, in a way that it should solve to find a vector with 27 entries n.N.N times. So as it has to do the same type of calculation many times I though that it would take advantage of all available threads but it isnt what is actually happening. It is only using 16 threads, even if I change n, have tested for n = 300 and 400. I also tryed to specify the threads manually from threadpoolctl, it only worked to decrease the number of running threads, but didnt saw difference on the performance though (don't have idea why).

The test code is

import numpy as np
from numpy import *
from timeit import default_timer as timer
import math;

import os, psutil
process = psutil.Process(os.getpid())

###### Main time loop ##########################################################

n = 400
N = 300

s = timer()

k_star = np.random.rand(27*n*N*N).reshape((27,n,N,N))
T = np.random.rand(27*27*n*N*N).reshape((27,27,n,N,N))

t1 = timer() - s

print("time to allocate k_star and T = " + str(t1))



for time in range(0,1001):

    s = timer()

    fout = np.transpose(linalg.solve(np.transpose(T),np.transpose(k_star)))

    t = timer() - s

    print("time to calculate fout solving the system = " + str(t))


    print(time)
    
    if time%1000 == 0:
        memory = process.memory_info().rss/1000000000
        print("RAM Memory occupied = " + str(round(memory,2)) + " Gb") 

May be useful to the answer, the output from np.__config__.show() is

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL,AVX512_ICL

I would appreciate any help to understand what is happening and how I can increase the performance of the running code by using all available threads. Thanks in advance.

答案1

得分: 1

你使用了带有最后一维中数量相当少的项目的4D/5D矩阵。问题在于,Numpy当前的实现在这种情况下效率不高

实际上,linalg.solve 内部调用 solve(以及关于用例的相似 solve1),它将ND数组线性化,然后调用 dgesv 多次在顺序循环中执行。您使用的是能够使用多个线程并行计算 dgesv 的OpenBLAS实现。但是,dgesv 是LAPACK原语,而不是真正的BLAS原语。OpenBLAS主要是BLAS实现,尽管它实现了一些LAPACK原语。它不是专门为高效计算LAPACK原语而设计的。实际上,只有少数库可以做到这一点(这实际上很难做到)。问题在于Numpy调用了大量的LAPACK函数,每个 dgesv 调用的持续时间都很短。因此,OpenBLAS无法充分利用多核处理器,因为它需要花费相当长的时间(即至少几微秒)来与其他核心共享工作并等待它们完成工作。实际上,核心数越多,并行开销越大。毕竟,如果你要组织一个与100个朋友的派对,通常比与1个朋友的派对组织时间更长,如果派对持续时间为10分钟,组织时间将远大于派对时间。

对于 n = 40N = 30,这是OpenBLAS线程在我的i5-9600KF核心上的调度(只有0.5毫秒的部分,以便我们可以看到发生了什么):

为什么 linalg.solve 不使用所有可用的线程

每个彩色切片代表一个OpenBLAS线程的执行。黑色切片是空闲时间。大部分时间都是空闲的,位于核心1上的主线程比其他线程做更多的工作,因为计算工作不平衡

如果您希望在您的用例中进行快速计算,最好只使用一个OpenBLAS线程,而不是尝试像OpenBLAS默认情况下那样并行化 dgesv。尽管Numpy在理论上可以做到这一点,但Numpy目前还不具备使用多线程的功能。主要的并行函数主要是BLAS/LAPACK库的函数,而不是直接Numpy。

英文:

You use 4D/5D matrices with a pretty small amount of items in the last dimension. The thing is the current implementation of Numpy is inefficient in this case.

Indeed, linalg.solve internally calls solve (and the similar solve1 regarding the use-case) which linearize the ND array so to call dgesv many time in a sequential loop. You use the OpenBLAS implementation which is able to compute dgesv in parallel with multiple threads. However, dgesv is a LAPACK primitive and not really a BLAS one. OpenBLAS is mainly a BLAS implementation though it implement some LAPACK primitives. It is not meant to compute LAPACK primitive very efficiently. In fact only few library can do so (this is actually hard to do). Still, the problem here is that Numpy make a lot of LAPACK calls and each dgesv call only last for a very short time. Thus OpenBLAS cannot fully benefit from multicore processors since it takes a significant time (ie. at least few microseconds) to share the work with other cores and wait for them to do the work. In fact, the higher the number of core the higher the parallel overhead. After all, if you want to organize a party with 100 friends, it generally takes more time to do than with 1 friend, and if the party last for 10 minutes, the organization time will be much bigger than the party time.

With n = 40 and N = 30, here is the scheduling of the OpenBLAS threads on my i5-9600KF cores (only a portion lasting for 0.5 ms so we can see what is happening):

为什么 linalg.solve 不使用所有可用的线程

Each coloured slice represent the execution of an OpenBLAS thread. Black slices are IDLE time. Most of the time is IDLE and the main thread located on the core 1 do more work than others because the computational work is not well balanced.

If you want this computation to be fast in your use-case, this is better to use only 1 OpenBLAS thread and run different dgesv calls on multiple threads rather than trying to parallelize dgesv like OpenBLAS do by default. While Numpy could theoretically do that, Numpy is not meant to use multiple thread yet. The only parallel functions are mainly the one of BLAS/LAPACK libraries, not directly Numpy.

huangapple
  • 本文由 发表于 2023年4月7日 03:30:50
  • 转载请务必保留本文链接:https://go.coder-hub.com/75953127.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定