torch fft with a GPU is much slower then fft with CPU

huangapple go评论64阅读模式
英文:

torch fft with a GPU is much slower then fft with CPU

问题

我在一台性能强大的服务器上运行以下简单的代码,该服务器配备了一堆 Nvidia RTX A5000/6000 GPU,使用的是 Cuda 11.8。出于某种原因,使用 GPU 进行 FFT 比使用 CPU 慢得多(慢 200-800 倍)。有没有人有任何可能原因的想法?我尝试了不同的 GPU,但结果大致相同。

import sigpy as sp
import torch
import time

arr = sp.shepp_logan((256, 256))
device = "cpu"
arr = torch.from_numpy(arr).to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = toc - tic
device = "cuda:5"
arr = arr.to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = toc - tic
print(f"CPU 时间:{cpu_time},GPU 时间:{gpu_time} 比例:{gpu_time / cpu_time}")

谢谢!

英文:

I'm running the following simple code on a strong server with a bunch of Nvidia RTX A5000/6000 with Cuda 11.8. For some reason, FFT with the GPU is much slower than with the CPU (200-800 times). Does anyone have an idea of why that might be? I tried different GPUs but the results remain approximately the same.

    import sigpy as sp
    import torch
    import time

    arr = sp.shepp_logan((256, 256))
    device = "cpu"
    arr = torch.from_numpy(arr).to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    cpu_time = toc - tic
    device = "cuda:5"
    arr = arr.to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    gpu_time = toc - tic
    print(f"CPU time: {cpu_time}, GPU time: {gpu_time} ratio: {gpu_time / cpu_time}")

Thanks!

答案1

得分: 1

好的,以下是翻译的内容:

"好的,深入挖掘一下,这不是比较计算时间的正确方法,为了更好地比较,我们需要进行更多的平均。在这样做之后,我看到GPU版本的确更快(对于更大的输入更为明显)。因此,似乎GPU需要一些“热身”时间(虽然我没有预料到单个测试点会有如此大的差异)。我很想听听是否有人能解释为什么会发生这种情况!

import numpy as np
import time
import torch

IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100

arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)

device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST

device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
英文:

Okay, so digging a little bit deeper, this is not the right way of comparing compute time, and for a better comparison, we need to average more. After doing that, I see that, indeed, the GPU version is faster (more noticeable for larger inputs). So there seems to be some "warm-up" time for the GPU (though I didn't expect such a big difference for a single test point). I'd love to hear if anyone has an explanation for why this is happening!

import numpy as np
import time
import torch

IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100

arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)

device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST

device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")

</details>



huangapple
  • 本文由 发表于 2023年6月9日 06:26:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436084.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定