英文:
torch fft with a GPU is much slower then fft with CPU
问题
我在一台性能强大的服务器上运行以下简单的代码,该服务器配备了一堆 Nvidia RTX A5000/6000 GPU,使用的是 Cuda 11.8。出于某种原因,使用 GPU 进行 FFT 比使用 CPU 慢得多(慢 200-800 倍)。有没有人有任何可能原因的想法?我尝试了不同的 GPU,但结果大致相同。
import sigpy as sp
import torch
import time
arr = sp.shepp_logan((256, 256))
device = "cpu"
arr = torch.from_numpy(arr).to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = toc - tic
device = "cuda:5"
arr = arr.to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = toc - tic
print(f"CPU 时间:{cpu_time},GPU 时间:{gpu_time} 比例:{gpu_time / cpu_time}")
谢谢!
英文:
I'm running the following simple code on a strong server with a bunch of Nvidia RTX A5000/6000 with Cuda 11.8. For some reason, FFT with the GPU is much slower than with the CPU (200-800 times). Does anyone have an idea of why that might be? I tried different GPUs but the results remain approximately the same.
import sigpy as sp
import torch
import time
arr = sp.shepp_logan((256, 256))
device = "cpu"
arr = torch.from_numpy(arr).to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = toc - tic
device = "cuda:5"
arr = arr.to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = toc - tic
print(f"CPU time: {cpu_time}, GPU time: {gpu_time} ratio: {gpu_time / cpu_time}")
Thanks!
答案1
得分: 1
好的,以下是翻译的内容:
"好的,深入挖掘一下,这不是比较计算时间的正确方法,为了更好地比较,我们需要进行更多的平均。在这样做之后,我看到GPU版本的确更快(对于更大的输入更为明显)。因此,似乎GPU需要一些“热身”时间(虽然我没有预料到单个测试点会有如此大的差异)。我很想听听是否有人能解释为什么会发生这种情况!
import numpy as np
import time
import torch
IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100
arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)
device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
arr = arr.to(device)
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST
device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
arr = arr.to(device)
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
英文:
Okay, so digging a little bit deeper, this is not the right way of comparing compute time, and for a better comparison, we need to average more. After doing that, I see that, indeed, the GPU version is faster (more noticeable for larger inputs). So there seems to be some "warm-up" time for the GPU (though I didn't expect such a big difference for a single test point). I'd love to hear if anyone has an explanation for why this is happening!
import numpy as np
import time
import torch
IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100
arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)
device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
arr = arr.to(device)
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST
device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
arr = arr.to(device)
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论