为什么conv2d在不同的批量大小下产生不同的结果

huangapple go评论79阅读模式
英文:

Why conv2d yields different results with different batch size

问题

我使用相同的数据但不同的批次大小使用stack作为输入来提供conv2d

a = torch.rand(1, 512, 16, 16)  # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)

a, b = a.cuda(), b.cuda()

net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()

ay = net(a)
by = net(b)

print('ay[0], by[0] max diff', torch.max(torch.abs(ay[0] - by[0])).item())
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0]))

然而,结果不同:

ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False

这个问题在Linux + V100 + torch1.9.0 + cu111上进行了测试,但到目前为止,许多其他配置也出现了这个问题。有什么线索吗?或者只是我误解了conv2d应该如何工作?

当我使用批次大小为1验证我的trainset结果时,我遇到了这个问题,但它与我记录的训练过程中的错误明显不同,所以我进行了检查,发现是conv2d层引起了这个问题。如果我正确理解conv2d,这不应该发生。

英文:

I feed the conv2d with the same data but different batch size (using stack) as input:

a = torch.rand(1, 512, 16, 16)  # (1, 512, 16, 16)
b = torch.cat([a, a, a], dim=0) # (3, 512, 16, 16)

a, b = a.cuda(), b.cuda()

net = nn.Conv2d(512, 1024, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
net = net.cuda()

ay = net(a)
by = net(b)

print('ay[0], by[0] max diff', torch.max(torch.abs(ay[0] - by[0])).item())
print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0]))

the result is however different:

ay[0], by[0] diff 3.5762786865234375e-06
ay[0], by[0] allclose False

this problem is tested on Linux + V100 + torch1.9.0 + cu111, but so far many other configuration also seen such problem. Any clue why? Or it is just simply I misunderstand how conv2d should work?

I run into this problem when I validate my trainset result using batch size 1, but it is significantly different from the error I recorded from training process, so I checked for it and find that it is the conv2d layer that causes this problem. If I understand conv2d correctly, this should not be happening.

答案1

得分: 2

据我所知,这个问题与con2d操作无关,而是由于浮点数精度有限引起的,这可以根据操作和架构而变化。这是一个已知的问题,例如,请参阅pytorch-forum上的讨论。

您当前运行的GPU计算可能正在使用单精度浮点计算,如果将其设置为双精度,则误差差异应该会减小:

torch.set_default_tensor_type(torch.DoubleTensor)

这意味着:

print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0], atol=1e-6))

应该打印:

ay[0], by[0] allclose True

至少在我的情况下,在Linux上使用A100进行测试时是这样的。

英文:

As far as I know the problem is not specific to con2d operations, but rather do to a limited floating point precision which can vary depending on the operations and architecture. This is a known issue, see e.g. this discussion on the pytorch-forum.

The GPU calculations you are currently running is probably using single-precision float computations, if you set it to be double-precision the error discrepancies should be reduced:

torch.set_default_tensor_type(torch.DoubleTensor)

meaning that:

print('ay[0], by[0] allclose', torch.allclose(ay[0], by[0], atol=1e-6))

Should print:

> ay[0], by[0] allclose True

At least this is the case for me when testing also on Linux using an A100.

huangapple
  • 本文由 发表于 2023年8月9日 11:37:31
  • 转载请务必保留本文链接:https://go.coder-hub.com/76864410-2.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定