将Python脚本转换为在GPU(CUDA)上运行。

huangapple go评论76阅读模式
英文:

Converting a python script to be run on a GPU (CUDA)

问题

I'm trying to get the following code to run on my RTX 3080 instead of my CPU:

import decimal
import numpy as np
from multiprocessing import Pool

def can_root(x):
    for i in range(2, (x // 2) + 1):
        y = float(round(decimal.Decimal(x ** (1 / i)), 20))
        if y.is_integer():
            y = int(y)
            a = [i]
            while True:
                go_again = 0
                for p in range(2, (y // 2) + 1):
                    go_again = 0
                    z = round(decimal.Decimal(y ** (1 / p)), 4)
                    z = float(z)
                    if z.is_integer():
                        z = int(z)
                        y = z
                        a.append(p)
                        go_again = 1
                        break
                if go_again == 1:
                    continue
                break
            y = int(y)
            power = 1
            for value in a:
                power *= value
            return x, y, power
    return None

def main():
    data = []
    pool = Pool(32)
    for result in pool.map(can_root, range(100000000, 999999999)):
        if result is not None:
            data.append(result)
    pool.close()
    pool.join()

    np.savez_compressed('data.npz', dta=data, allow_pickle=False)  # for portability

    loadback = np.load('data.npz')['dta']
    print(loadback)

if __name__ == "__main__":
    main()

Even with 32 threads to run this, it would take years (I haven't done the math so don't quote me on that but it's a while). I'm hoping that it would run much faster on a GPU than a CPU based on the repetitive nature of the script. However, I've been having some trouble with converting it. I haven't had any experience with CUDA at all nor converting Python to CUDA so I'm walking in blind. I have tried enlisting the help of ChatGPT and Bard; however, the memes are right, 5 minutes of coding and 5 years of debugging. So far I've tried using this Py2CUDA; however, I can't find any documentation, and it keeps throwing a lot of errors with my code, and NUMBA. However, I haven't been able to integrate it that well with my code, but if someone can tell if it can and I just missed something great! There are 3 things I'm really looking for, and I hope someone can help.

  1. Will it actually perform better on a GPU?
  2. Can I just add some decorators to my code and with a good enough library poof it works on a GPU, or will I basically have to rewrite every line of the script?
  3. I haven't had much luck finding a good tutorial or guide regarding Python to CUDA, so if anyone knows a good one, that would be much appreciated.
英文:

I'm trying to get the following code to run on my RTX 3080 instead of my CPU:

import decimal
import numpy as np
from multiprocessing import Pool


def can_root(x):
    for i in range(2, (x // 2) + 1):
        y = float(round(decimal.Decimal(x ** (1 / i)), 20))
        if y.is_integer():
            y = int(y)
            a = [i]
            while True:
                go_again = 0
                for p in range(2, (y // 2) + 1):
                    go_again = 0
                    z = round(decimal.Decimal(y ** (1 / p)), 4)
                    z = float(z)
                    if z.is_integer():
                        z = int(z)
                        y = z
                        a.append(p)
                        go_again = 1
                        break
                if go_again == 1:
                    continue
                break
            y = int(y)
            power = 1
            for value in a:
                power *= value
            return x, y, power
    return None


def main():
    data = []
    pool = Pool(32)
    for result in pool.map(can_root, range(100000000, 999999999)):
        if result is not None:
            data.append(result)
    pool.close()
    pool.join()

    np.savez_compressed('data.npz', dta=data, allow_pickle=False)  # for portability

    loadback = np.load('data.npz')['dta']
    print(loadback)


if __name__ == "__main__":
    main()

Even with 32 threads to run this, it would take years (I haven't done the math so don't quote me on that but it's a while). I'm hoping that it would run much faster on a GPU than a CPU based on the repetitive nature of the script. However, I've been having some trouble with converting it. I haven't had any experience with CUDA at all nor converting Python to CUDA so I'm walking in blind. I have tried enlisting the help of Chatgpt and Bard however the memes are right, 5 minutes of coding and 5 years of debugging. So far I've tried using this Py2CUDA however I can't find any documentation and it keeps throwing a lot of errors with my code, and NUMBA However I haven't been able to integrate it that well with my code but if someone can tell if it can and I just missed something great! There are 3 things I'm really looking for and I hope someone can help.

  1. Will it actually perform better on a GPU?
  2. Can I just add some decorators to my code and with a good enough library poof it works on a GPU, or will I basically have to rewrite every line of the script
  3. I haven't had much luck finding a good tutorial or guide regarding Python to CUDA so if anyone knows a good one that would be much appreciated.

答案1

得分: 2

从你拥有的东西直接跳到使用GPU来加速代码是一个错误。第一步是实际优化算法。你的算法过于复杂,这使它变得很慢(见下面的数字)。

你的问题是找到 ab,其中 a^b = n。最小的 a 可以是2,最大的是 sqrt(n)(因为 b 不能小于2)。然后,你可以利用对数来重新表述问题,即找到哪个底数对数(即 a )的 n 返回一个整数结果。遍历所有可能的 a 值,你检查结果是否是整数(由于浮点运算,我编写了检查以基于一些设置的容差)。

import math

def can_root(n, tol=1e-10):
    for candidate in range(2, math.floor(math.sqrt(n))+1):
        b = math.log(n, candidate)
        if abs(b - round(b)) < tol:
            b = int(round(b))
            a = int(round(n**(1/b)))
            return n, a, b
    return None

results = []
for n in range(2, 100000):
    result = can_root(n)
    if result is not None:
        results.append(result)

这是一个固有的计算负担较重的问题,所以我的代码对非常大的数字仍然很慢。尽管如此,在上述范围(2-99999,包括在内)上进行测试,在我的机器上花费了 3.47 秒,而你的代码花费了 37.2 分钟

英文:

Jumping straight from what you have to using a GPU to speed up your code is a mistake. The first step is to actually optimize the algorithm. Your algorithm is overly complicated, which makes it slow (see the numbers below).

Your problem is to find a and b where a^b = n. The smallest a can be is 2 and the largest is sqrt(n) (since b cannot be smaller than 2). You can then make use of logarithms to rewrite the problem as looking for what base logarithm (i.e. a) of n returns an integer result. Looping through all the possible values of a, you check if the result is an integer (because of floating point arithmetic, I wrote the check to be based on some set tolerance).

import math

def can_root(n, tol=1e-10):
    for candidate in range(2, math.floor(math.sqrt(n))+1):
        b = math.log(n, candidate)
        if abs(b - round(b)) < tol:
            b = int(round(b))
            a = int(round(n**(1/b)))
            return n, a, b
    return None

results = []
for n in range(2, 100000):
    result = can_root(n)
    if result is not None:
        results.append(result)

This is an inherently computationally expensive problem, so my code is still slow for very large numbers. That said, testing it on the above range (2-99999, inclusive), took 3.47 seconds on my machine, compared to your code which took 37.2 minutes.

答案2

得分: 2

这是CUDA中使用@jared算法的1个解决方案,使用更好的算法的1个解决方案以及最佳解决方案。我总共实现了4个版本来展示更好的算法 > 更好的硬件(你要求的第二种)。

寻找res = [can_root(n) for n in range(0, 10^9]的时间成本为:

  • 您的原始代码:约7000年
  • @jared的答案:约40天
  • @jared的C++算法(can_root_cpu):约3.3天
  • @jared的CUDA算法(can_root):2080ti上约50秒,3080上可能更快
  • 更好的算法(can_root_cpu_sieve):创建埃拉托斯特尼筛的18秒,计算can_root的19秒 => 总共37秒
  • 最佳算法(不使用can_root):0.1秒。用于测试的代码实际上非常慢,需要2秒,因为它需要写入大小为B的数组,而不仅仅是收集所需的值。

@jared的算法的时间复杂度为O(N * sqrt(N))。对于N = 10^5,他的Python代码需要3.47秒。因此,对于N = 10^9,需要3.47秒 * (10^9 / 10^5) * sqrt(10^9 / 10^5) = 40天。您的代码的时间复杂度为O(N^2)

筛法算法的时间复杂度约为O(N * log(log(N)),而最佳解决方案的成本为O(sqrt(B - A) * log(log(sqrt(B - A)))。我将为您提供一个成本为O(sqrt(B) * log(log(sqrt(B)))的解决方案,因为这些思想类似。

对于GPU上的@jared算法,我们需要一些技巧:

  • 在游戏卡上,double == fp64非常慢。在2080ti上,FP32性能为13.45 TFLOP,而FP64性能为0.42 TFLOP,比率为1:32。
  • 因此,我们必须使用float。但它的精度较低,我们将得到许多错误答案(我已经测试过了)。
  • 因此,我们不仅要检查if abs(b - round(b)) < tol:,还要检查candiate^b == n 使用整数。然后它就是正确的。大多数情况下,abs(b - round(b)) >= tol,因此这个额外的检查不会增加明显的成本。
  • 如果n = a^b是一个偶数,a必须是一个偶数。当n是奇数时也是如此。因此,我们只需要循环遍历偶数或奇数。这节省了50%的时间成本。

我的改进的can_root_cpu_sieve使用以下思想:

  • N可以分解为形式为N = np.prod([prime[k] ^ expo[k] for k in range(K)])的素数数组长度K。例如,18 = 3^2 * 2^136 = 3^2 * 2^2
  • 如果a^b = N,则expo[k] % b == 0 for k in range(K)。同时我们只接受b > 1
  • b最大时,a最小 -> b = gcd(expo[:])a = np.prod([prime[k] ^ (expo[k] / b) for k in range(K)])
  • 要快速找到一个数的素数因子,我们需要初始化一个埃拉托斯特尼筛。在我的代码中,sieve[N] = N的最大素数因子。要分解,做类似于while (sieve[N] != 1) N /= sieve[N];

最佳解决方案也是最简单的。a^b = Nb >= 2N <= 10^9意味着a <= sqrt(10^9)

我们可以遍历所有可能的ab并计算a^b

for (int a = sqrt(B) + 1; a >= 2; a--) {
    int64_t n = a;
    for (int b = 2; b <= 30; b++) {
      n *= a;
      if (n >= B) break;
      ...
}

内循环平均耗时约为log(log(sqrt(B))),因此总时间成本为sqrt(B) * log(log(sqrt(B)))


下面的程序计算了res = [can_root(n) for n in range(A, B],同时使用CPU和GPU,并比较它们的结果以确保正确性。它还测量了运行时间。您可以将can_root_cpu_sieve替换为can_root_cpu,以确认所有4个版本都给出了相同的结果。

#include &lt;cuda_runtime.h&gt;
#include &lt;iostream&gt;
#include &lt;chrono&gt;
#include &lt;cmath&gt;
#include &lt;string&gt;
#include &lt;unordered_map&gt;
#include &lt;vector&gt;
#include &lt;algorithm&gt;
using std::cout;

class MyTimer {
    std::chrono::time_point&lt;std::chrono::system_clock&gt; start;

public:
    void startCounter() {
        start = std::chrono::system_clock::now();
    }

    int64_t getCounterNs() {
        return std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt;(std::chrono::system_clock::now() - start).count();
    }

    int64_t getCounterMs() {
        return std::chrono::duration_cast&lt;std::chrono::milliseconds&gt;(std::chrono::system_clock::now() - start).count();
    }

    double getCounterMsPrecise()

<details>
<summary>英文:</summary>

Here&#39;s 1 solution in CUDA using @jared algorithm, 1 solution using a better algorithm, and the best solution. I implemented **4 versions** total to show better algorithm &gt; better hardware (you asked for the 2nd).

Time cost to find `res = [can_root(n) for n in range(0, 10^9]` are:

- Your original code: ~7000 years
- @jared answer: ~40 days
- @jared algo using C++ (`can_root_cpu`): ~3.3 days
- @jared algo using CUDA (`can_root`): **50 seconds on 2080ti, probably much faster on 3080**
- Better algorithm (`can_root_cpu_sieve`): **18 seconds for creating Sieve of Eratosthenes, 19 seconds for can_root -&gt; 37s total**
- Best algorithm (don&#39;t use `can_root`): **0.1 second**. The code used for testing is actually really slow and takes 2 seconds because it requires writing to an array size `B`, instead of just collecting the required values.

The algorithm by @jared has cost `O(N * sqrt(N))`. With `N = 10^5`, his Python code takes `3.47 second`. So with `N = 10^9`, it&#39;ll take `3.47 second * (10^9 / 10^5) * sqrt(10^9 / 10^5) = 40 days`. Your code has time complexity `O(N^2)`. 

Sieve algo has time complexity around `O(N * log(log(N))`, and the best solution has cost `O(sqrt(B - A) * log(log(sqrt(B - A))`. I&#39;ll give you a solution with cost `O(sqrt(B) * log(log(sqrt(B)))`, since the ideas are similar.

For @jared algorithm on GPU, we need a few tricks:

- `double == fp64` is EXTREMELY slow on gaming cards. On 2080ti, FP32 performance is  13.45 TFLOP; while FP64 performance is 0.42 TFLOP -&gt; 1:32 ratio
- So, we have to use `float`. But it has low precision, and we&#39;ll get a lot of wrong answers (I&#39;ve tested) with this algorithm.
- So instead of just checking `if abs(b - round(b)) &lt; tol:`, we also check `candiate^b == n` **using integers**. Then it&#39;ll be correct. Most of the time `abs(b - round(b)) &gt;= tol`, so this extra check doesn&#39;t add noticeable cost. 
- If `n = a^b` is an even number, `a` must be an even number. Similar when `n` is odd. So, we only need to loop over either even or odd numbers. This save 50% of the time cost.

-------------------
My improved `can_root_cpu_sieve` uses the following ideas:
- `N` can be factored into array of prime numbers length K with the form: `N = np.prod([prime[k] ^ expo[k] for k in range(K)]`. For example, `18 = 3^2 * 2^1`, `36 = 3^2 * 2^2`.
- If `a^b = N`, then `expo[k] % b == 0 for k in range(K)`. Also we only accept `b &gt; 1`
- `a` will be smallest when `b` is largest -&gt; `b = gcd(expo[:])`, and `a = np.prod([prime[k] ^ (expo[k] / b) for k in range(K)]`
- To quickly find prime factors of a number, we need to initialize a Sieve of Eratosthenes. In my code, `sieve[N] = largest prime factor of N`. To factor, do something like `while (sieve[N] != 1) N /= sieve[N];`

-----------------------

The best solution is also the simplest. `a^b = N`, `b &gt;= 2`, `N &lt;= 10^9` means `a &lt;= sqrt(10^9)`.

We can just loop over all possible `a` and `b` and compute `a^b`.

for (int a = sqrt(B) + 1; a >= 2; a--) {
int64_t n = a;
for (int b = 2; b <= 30; b++) {
n *= a;
if (n >= B) break;
...
}

The inner loop cost around `log(log(sqrt(B)))` on average, so the total time cost is `sqrt(B) * log(log(sqrt(B)))`
-----------------------
The program below computes `res = [can_root(n) for n in range(A, B]` using both CPU and GPU, and compares their results to make sure it&#39;s correct. It also measures run time. You can replace `can_root_cpu_sieve` with `can_root_cpu` to confirm that all 4 versions give the same results.

#include <cuda_runtime.h>
#include <iostream>
#include <chrono>
#include <cmath>
#include <string>
#include <unordered_map>
#include <vector>
#include <algorithm>
using std::cout;

class MyTimer {
std::chrono::time_point<std::chrono::system_clock> start;

public:
void startCounter() {
start = std::chrono::system_clock::now();
}

int64_t getCounterNs() {
return std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt;(std::chrono::system_clock::now() - start).count();
}
int64_t getCounterMs() {
return std::chrono::duration_cast&lt;std::chrono::milliseconds&gt;(std::chrono::system_clock::now() - start).count();
}
double getCounterMsPrecise() {
return std::chrono::duration_cast&lt;std::chrono::nanoseconds&gt;(std::chrono::system_clock::now() - start).count()
/ 1000000.0;
}

};

host device
int intpow(int x, int n) {
int res = 1;
int mult = x;
while (n) {
if (n & 1) res *= mult;
mult = mult * mult;
n >>= 1;
}
return res;
}

void can_root_cpu(int *res, const int A, const int B, float eps_big = 1e-7, float eps_small = 1e-10)
{
for (int n = A; n < B; n++) {
int idx = 2 * (n - A);
res[idx] = 0;
res[idx + 1] = 0;

int lim = round(sqrt(n));
for (int candidate = 2; candidate &lt;= lim; candidate++) {
double b = log(n) / log(candidate);
double diff = fabs(b - round(b));
if (diff &lt; eps_small) {
res[idx + 1] = round(b);
res[idx] = candidate;
break;
} else if (diff &lt; eps_big) {
// in case the difference is small but not tiny, we check using int.
// This is because float might have precision issue
int bint = round(b);
if (intpow(candidate, bint) == n) {
res[idx + 1] = bint;
res[idx] = candidate;
break;
}
}
}

}
}

int gcd(int a, int b) {
while (b) {
int temp = b;
b = a % b;
a = temp;
}
return a;
}

void can_root_cpu_sieve(int* restrict res, const int A, const int B,
const int* restrict sieve,
float eps = 1e-10)
{
std::vector<std::pair<int,int>> factors;
factors.reserve(64);

for (int n = A; n < B; n++) {
int idx = 2 * (n - A);
res[idx] = 0;
res[idx + 1] = 0;

factors.clear();    
int N = n;
int prime_factor_gcd = 0;
while (N != 1) {
const int K = sieve[N];
int expo = 0;
if (K &gt; 1) {
while (N % K == 0) {
N /= K;
expo++;
}
} else {
prime_factor_gcd = 1;
break;
}      
if (prime_factor_gcd == 0) prime_factor_gcd = expo;
else prime_factor_gcd = gcd(prime_factor_gcd, expo);
if (prime_factor_gcd == 1) break;
factors.emplace_back(K, expo);
}
if (prime_factor_gcd &lt;= 1) continue;
int base = 1;
for (const auto &amp;data : factors)
base *= intpow(data.first, data.second / prime_factor_gcd);
res[idx] = base;
res[idx + 1] = prime_factor_gcd;        

}
}

void gen_can_root(int* restrict res, const int A, const int B) {
int lim = round(sqrt(B)) + 1;

for (int a = lim; a >= 2; a--) {
int64_t n = a;
for (int b = 2; b <= 30; b++) {
n *= a;
if (n >= B) break;

  res[2 * (n - A)] = a;
res[2 * (n - A) + 1] = b;
}

}
}

//--------------------

global
void can_root(int *res, const int A, const int B, float eps = 1e-4)
{
const int start = blockIdx.x * blockDim.x + threadIdx.x;
const int stride = blockDim.x * gridDim.x;

for (int n = A + start; n < B; n += stride) {
int idx = 2 * (n - A);
res[idx] = 0;
res[idx + 1] = 0;

int lim = roundf(sqrtf(n));
const int start_candidate = (n % 2 == 0) ? 2 : 3;
for (int candidate = start_candidate; candidate &lt;= lim; candidate += 2) {
float b = logf(n) / logf(candidate);
if (fabsf(b - roundf(b)) &lt; eps) {
int bint = lroundf(b);
if (intpow(candidate, bint) == n) {
res[idx + 1] = bint;
res[idx] = candidate;
break;
}
}
}

}
}

int main(int argc, char* argv[])
{
int A = 2;
int B = 1'000'000;
bool test_sieve = 0;

if (argc == 2) {
B = std::stoi(argv1);
}
if (argc >= 3) {
A = std::stoi(argv1);
B = std::stoi(argv2);
}
if (argc >= 4) {
test_sieve = 1;
}

//--------------
MyTimer timer;
int* res0;
int* res1;

timer.startCounter();
cudaMallocManaged(&res0, (B - A) * 2 * sizeof(int));
res1 = new int[(B - A) * 2 * sizeof(int)];
cudaMemsetAsync(res0, 0, (B - A) * 2 * sizeof(int), 0);
memset(res1, 0, (B - A) * 2 * sizeof(int));
cout << "Allocate memory = " << timer.getCounterMsPrecise() << "\n";

int* sieve;
int sieve_size = B;
if (test_sieve) {
timer.startCounter();
sieve = new int[sieve_size];
for (int i = 0; i < sieve_size; i++) sieve[i] = 1;
sieve[0] = 0;
sieve1 = 1;

for (int i = 2; i &lt; sieve_size; i++) {
if (sieve[i] &gt; 1) continue;
// Normally it&#39;s &quot;j = i * i&quot; because it&#39;s faster.
// But &quot;j = 2 * i&quot; will give sorted prime factors
for (int j = 2 * i; j &lt; sieve_size; j += i) {
sieve[j] = i;
}
}
cout &lt;&lt; &quot;sieve cost = &quot; &lt;&lt; timer.getCounterMsPrecise() &lt;&lt; &quot;\n&quot;;

}

int ntest = 5;
int wrong = 0;
double total_cost2 = {0};
for (int t = 0; t <= ntest; t++) {
cudaDeviceSynchronize();
timer.startCounter();
can_root<<<1024,512>>>(res0, A, B);
cudaDeviceSynchronize();
double cost0 = timer.getCounterMsPrecise();
total_cost[0] += cost0;

timer.startCounter();
//can_root_cpu(res1, A, B);
if (test_sieve) can_root_cpu_sieve(res1, A, B, sieve);
else gen_can_root(res1, A, B);
double cost1 = timer.getCounterMsPrecise();
total_cost[1] += cost1;
cout &lt;&lt; &quot;cost = &quot; &lt;&lt; cost0 &lt;&lt; &quot; &quot; &lt;&lt; cost1 &lt;&lt; &quot;\n&quot;;
for (int n = A; n &lt; B; n++) {
int idx = 2 * (n - A);
if (res0[idx] != res1[idx] || res0[idx + 1] != res1[idx + 1]) {
cout &lt;&lt; &quot;ERROR &quot; &lt;&lt; n &lt;&lt; &quot; &quot; &lt;&lt; res0[idx] &lt;&lt; &quot; &quot; &lt;&lt; res0[idx + 1] &lt;&lt; &quot; &quot; &lt;&lt; res1[idx] &lt;&lt; &quot; &quot; &lt;&lt; res1[idx + 1] &lt;&lt; std::endl;
wrong++;
if (wrong &gt;= 10) exit(1);
}
}
cudaMemPrefetchAsync(res0, (B - A) * 2 * sizeof(int), 0, 0);

}

if (wrong == 0) {
cout << "NO ERROR" << std::endl;
}

return 0;
}


Use the script `./run.sh A B test_sieve`

nvcc -o main can_root.cu -O3 -std=c++17
./main $1 $2 $3

such as

./run.sh 0 1000000 0 # compare GPU vs optimal solution, B = 10^6
./run.sh 2 1000000 1 # compare GPU vs sieve solution, A = 2, B = 10^6
./run.sh 0 1000000000 0 # GPU vs optimal, B = 10^9


**Note:** So we have reduced the time cost from 7000 years to ~37 seconds, just by changing the algorithm (and language). And reduced it to ~0.1 second, by rephrasing the problem. Using GPU isn&#39;t enough to make up for the difference in big-O time cost.
It&#39;s possible to optimize this further, but I&#39;ll leave that as an exercise.
</details>

huangapple
  • 本文由 发表于 2023年6月25日 21:36:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/76550677.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定