torch fft with a GPU is much slower then fft with CPU

huangapple go评论99阅读模式
英文:

torch fft with a GPU is much slower then fft with CPU

问题

我在一台性能强大的服务器上运行以下简单的代码,该服务器配备了一堆 Nvidia RTX A5000/6000 GPU,使用的是 Cuda 11.8。出于某种原因,使用 GPU 进行 FFT 比使用 CPU 慢得多(慢 200-800 倍)。有没有人有任何可能原因的想法?我尝试了不同的 GPU,但结果大致相同。

  1. import sigpy as sp
  2. import torch
  3. import time
  4. arr = sp.shepp_logan((256, 256))
  5. device = "cpu"
  6. arr = torch.from_numpy(arr).to(device)
  7. tic = time.perf_counter()
  8. res = torch.fft.fft2(arr, dim=(-2, -1))
  9. toc = time.perf_counter()
  10. cpu_time = toc - tic
  11. device = "cuda:5"
  12. arr = arr.to(device)
  13. tic = time.perf_counter()
  14. res = torch.fft.fft2(arr, dim=(-2, -1))
  15. toc = time.perf_counter()
  16. gpu_time = toc - tic
  17. print(f"CPU 时间:{cpu_time},GPU 时间:{gpu_time} 比例:{gpu_time / cpu_time}")

谢谢!

英文:

I'm running the following simple code on a strong server with a bunch of Nvidia RTX A5000/6000 with Cuda 11.8. For some reason, FFT with the GPU is much slower than with the CPU (200-800 times). Does anyone have an idea of why that might be? I tried different GPUs but the results remain approximately the same.

  1. import sigpy as sp
  2. import torch
  3. import time
  4. arr = sp.shepp_logan((256, 256))
  5. device = "cpu"
  6. arr = torch.from_numpy(arr).to(device)
  7. tic = time.perf_counter()
  8. res = torch.fft.fft2(arr, dim=(-2, -1))
  9. toc = time.perf_counter()
  10. cpu_time = toc - tic
  11. device = "cuda:5"
  12. arr = arr.to(device)
  13. tic = time.perf_counter()
  14. res = torch.fft.fft2(arr, dim=(-2, -1))
  15. toc = time.perf_counter()
  16. gpu_time = toc - tic
  17. print(f"CPU time: {cpu_time}, GPU time: {gpu_time} ratio: {gpu_time / cpu_time}")

Thanks!

答案1

得分: 1

好的,以下是翻译的内容:

"好的,深入挖掘一下,这不是比较计算时间的正确方法,为了更好地比较,我们需要进行更多的平均。在这样做之后,我看到GPU版本的确更快(对于更大的输入更为明显)。因此,似乎GPU需要一些“热身”时间(虽然我没有预料到单个测试点会有如此大的差异)。我很想听听是否有人能解释为什么会发生这种情况!

  1. import numpy as np
  2. import time
  3. import torch
  4. IM_SIZE = 512
  5. BATCH_SIZE = 8
  6. N_TEST = 10000
  7. RAND = 100
  8. arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
  9. arrs = torch.from_numpy(arrs)
  10. device = "cpu"
  11. tic = time.perf_counter()
  12. for i in range(N_TEST):
  13. arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
  14. arr = arr.to(device)
  15. res = torch.fft.fft2(arr, dim=(-2, -1))
  16. toc = time.perf_counter()
  17. cpu_time = (toc - tic) / N_TEST
  18. device = "cuda:5"
  19. tic = time.perf_counter()
  20. for i in range(N_TEST):
  21. arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
  22. arr = arr.to(device)
  23. res = torch.fft.fft2(arr, dim=(-2, -1))
  24. toc = time.perf_counter()
  25. gpu_time = (toc - tic) / N_TEST
  26. print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
英文:

Okay, so digging a little bit deeper, this is not the right way of comparing compute time, and for a better comparison, we need to average more. After doing that, I see that, indeed, the GPU version is faster (more noticeable for larger inputs). So there seems to be some "warm-up" time for the GPU (though I didn't expect such a big difference for a single test point). I'd love to hear if anyone has an explanation for why this is happening!

  1. import numpy as np
  2. import time
  3. import torch
  4. IM_SIZE = 512
  5. BATCH_SIZE = 8
  6. N_TEST = 10000
  7. RAND = 100
  8. arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
  9. arrs = torch.from_numpy(arrs)
  10. device = "cpu"
  11. tic = time.perf_counter()
  12. for i in range(N_TEST):
  13. arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
  14. arr = arr.to(device)
  15. res = torch.fft.fft2(arr, dim=(-2, -1))
  16. toc = time.perf_counter()
  17. cpu_time = (toc - tic) / N_TEST
  18. device = "cuda:5"
  19. tic = time.perf_counter()
  20. for i in range(N_TEST):
  21. arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
  22. arr = arr.to(device)
  23. res = torch.fft.fft2(arr, dim=(-2, -1))
  24. toc = time.perf_counter()
  25. gpu_time = (toc - tic) / N_TEST
  26. print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")
  1. </details>

huangapple
  • 本文由 发表于 2023年6月9日 06:26:09
  • 转载请务必保留本文链接:https://go.coder-hub.com/76436084.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定