2023年8月9日 11:37:31go评论137阅读模式

英文:

Why conv2d yields different results with different batch size

问题

我使用相同的数据但不同的批次大小（使用堆叠）作为输入来输入conv2d：

a = torch.rand(1, 512, 16, 16)  # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)
a, b = a.cuda(), b.cuda()
net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()
ay = net(a)
by = net(b)
print('ay[0], by[0] max diff', torch.max(torch.abs(ay[0] - by[0])).item())
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0]))

然而，结果是不同的：

ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False

这个问题在Linux + V100 + torch1.9.0 + cu111上进行了测试，但是在许多其他配置中也出现了这个问题。有什么线索吗？或者只是我误解了conv2d的工作原理？

当我使用批次大小为1验证我的trainset结果时，我遇到了这个问题，但它与我在训练过程中记录的错误明显不同，所以我进行了检查，并发现是conv2d层引起了这个问题。如果我正确理解conv2d，这不应该发生。

英文:

I feed the conv2d with the same data but different batch size (using stack) as input:

a = torch.rand(1, 512, 16, 16)  # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)
a, b = a.cuda(), b.cuda()
net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()
ay = net(a)
by = net(b)
print(&#39;ay[0], by[0] max diff&#39;, torch.max(torch.abs(ay[0] - by[0])).item())
print(&#39;ay[0], by[0] allclose&#39;, torch.allclose(ay[0], by[0]))

the result is however different:

ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False

this problem is tested on Linux + V100 + torch1.9.0 + cu111, but so far many other configuration also seen such problem. Any clue why? Or it is just simply I misunderstand how conv2d should work?

I run into this problem when I validate my trainset result using batch size 1, but it is significantly different from the error I recorded from training process, so I checked for it and find that it is the conv2d layer that causes this problem. If I understand conv2d correctly, this should not be happening.

答案1

得分: 2

据我所知，这个问题并不特定于con2d操作，而是由于浮点精度有限，这取决于操作和架构。这是一个已知的问题，可以参考pytorch-forum上的讨论。

你目前正在运行的GPU计算可能是使用单精度浮点计算，如果将其设置为双精度，误差差异应该会减小：

torch.set_default_tensor_type(torch.DoubleTensor)

意味着：

print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0], atol=1e-6))

应该打印出：

ay[0], by[0] allclose True

至少在我测试时，在Linux上使用A100也是如此。

英文:

As far as I know the problem is not specific to con2d operations, but rather do to a limited floating point precision which can vary depending on the operations and architecture. This is a known issue, see e.g. this discussion on the pytorch-forum.

The GPU calculations you are currently running is probably using single-precision float computations, if you set it to be double-precision the error discrepancies should be reduced:

torch.set_default_tensor_type(torch.DoubleTensor)

meaning that:

print(&#39;ay[0], by[0] allclose&#39;, torch.allclose(ay[0], by[0], atol=1e-6))

Should print:

> ay[0], by[0] allclose True

At least this is the case for me when testing also on Linux using an A100.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么使用不同的批次大小时，conv2d会产生不同的结果？

问题

答案1

Python脚本以将Plex中的所有电影和节目的字幕设置为英语非强制字幕。

如何避免在pytest中出现”fixtures not found”错误？

Dataflow – 将 JSON 文件添加到 BigQuery

如何从邮政编码级别的形状文件中保留美国大陆的形状？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。