问题

我在一台性能强大的服务器上运行以下简单的代码，该服务器配备了一堆 Nvidia RTX A5000/6000 GPU，使用的是 Cuda 11.8。出于某种原因，使用 GPU 进行 FFT 比使用 CPU 慢得多（慢 200-800 倍）。有没有人有任何可能原因的想法？我尝试了不同的 GPU，但结果大致相同。

import sigpy as sp
import torch
import time
arr = sp.shepp_logan((256, 256))
device = "cpu"
arr = torch.from_numpy(arr).to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = toc - tic
device = "cuda:5"
arr = arr.to(device)
tic = time.perf_counter()
res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = toc - tic
print(f"CPU 时间：{cpu_time}，GPU 时间：{gpu_time} 比例：{gpu_time / cpu_time}")

谢谢！

英文:

I'm running the following simple code on a strong server with a bunch of Nvidia RTX A5000/6000 with Cuda 11.8. For some reason, FFT with the GPU is much slower than with the CPU (200-800 times). Does anyone have an idea of why that might be? I tried different GPUs but the results remain approximately the same.

    import sigpy as sp
    import torch
    import time
    arr = sp.shepp_logan((256, 256))
    device = &quot;cpu&quot;
    arr = torch.from_numpy(arr).to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    cpu_time = toc - tic
    device = &quot;cuda:5&quot;
    arr = arr.to(device)
    tic = time.perf_counter()
    res = torch.fft.fft2(arr, dim=(-2, -1))
    toc = time.perf_counter()
    gpu_time = toc - tic
    print(f&quot;CPU time: {cpu_time}, GPU time: {gpu_time} ratio: {gpu_time / cpu_time}&quot;)

Thanks!

答案1

得分: 1

好的，以下是翻译的内容：

"好的，深入挖掘一下，这不是比较计算时间的正确方法，为了更好地比较，我们需要进行更多的平均。在这样做之后，我看到GPU版本的确更快（对于更大的输入更为明显）。因此，似乎GPU需要一些“热身”时间（虽然我没有预料到单个测试点会有如此大的差异）。我很想听听是否有人能解释为什么会发生这种情况！

import numpy as np
import time
import torch
IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100
arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)
device = "cpu"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST
device = "cuda:5"
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f"CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}")

英文:

Okay, so digging a little bit deeper, this is not the right way of comparing compute time, and for a better comparison, we need to average more. After doing that, I see that, indeed, the GPU version is faster (more noticeable for larger inputs). So there seems to be some "warm-up" time for the GPU (though I didn't expect such a big difference for a single test point). I'd love to hear if anyone has an explanation for why this is happening!

import numpy as np
import time
import torch
IM_SIZE = 512
BATCH_SIZE = 8
N_TEST = 10000
RAND = 100
arrs = np.random.randn(RAND, IM_SIZE, IM_SIZE)
arrs = torch.from_numpy(arrs)
device = &quot;cpu&quot;
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
cpu_time = (toc - tic) / N_TEST
device = &quot;cuda:5&quot;
tic = time.perf_counter()
for i in range(N_TEST):
    arr = torch.tile(arrs[i % RAND], [BATCH_SIZE, 1, 1])
    arr = arr.to(device)
    res = torch.fft.fft2(arr, dim=(-2, -1))
toc = time.perf_counter()
gpu_time = (toc - tic) / N_TEST
print(f&quot;CPU time: {cpu_time * 1000} ms, GPU time: {gpu_time * 1000} ms ratio: {gpu_time / cpu_time}&quot;)


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

torch fft with a GPU is much slower then fft with CPU

问题

答案1

argf和atan2f在C++中有什么区别？

这个逆FFT实现有什么问题？

改进深度学习模型以检测不同条件下的火车车厢间隙。

torch.onnx.export报告：“未安装模块onnx！”

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。