Loss function giving nan in pytorch

huangapple go评论69阅读模式
英文:

Loss function giving nan in pytorch

问题

在PyTorch中,我有一个损失函数,其中包括 1/x 以及其他一些项。我的神经网络的最后一层是一个Sigmoid层,因此值将介于0和1之间。

由于我的损失函数变成了这样,必须让传递给 1/x 的某些值在某个点变得非常小:

损失:11.047459  [729600/235474375]
损失:9.348356  [731200/235474375]
损失:7.184393  [732800/235474375]
损失:8.699876  [734400/235474375]
损失:7.178806  [736000/235474375]
损失:8.090066  [737600/235474375]
损失:12.415799  [739200/235474375]
损失:10.422441  [740800/235474375]
损失:8.335846  [742400/235474375]
损失:     NaN  [744000/235474375]
损失:     NaN  [745600/235474375]
损失:     NaN  [747200/235474375]
损失:     NaN  [748800/235474375]
损失:     NaN  [750400/235474375]

我想知道是否有办法在遇到 NaN 时进行“倒带”,或者定义损失函数,使其永远不会遇到 NaN?谢谢!

英文:

In pytorch, I have a loss function of 1/x plus a few other terms. The last layer of my neural net is a sigmoid, so the values will be between 0 and 1.

Some value fed to 1/x must get really small at some point because my loss has become this:

loss: 11.047459  [729600/235474375]
loss: 9.348356  [731200/235474375]
loss: 7.184393  [732800/235474375]
loss: 8.699876  [734400/235474375]
loss: 7.178806  [736000/235474375]
loss: 8.090066  [737600/235474375]
loss: 12.415799  [739200/235474375]
loss: 10.422441  [740800/235474375]
loss: 8.335846  [742400/235474375]
loss:     nan  [744000/235474375]
loss:     nan  [745600/235474375]
loss:     nan  [747200/235474375]
loss:     nan  [748800/235474375]
loss:     nan  [750400/235474375]

I'm wondering if there's any way to "rewind" if nan is hit or define the loss function so that it's never hit? Thanks!

答案1

得分: 3

Clip your loss to fall within a reasonable range to prevent gradient explosion (i.e. continually climbing out of the local neighborhood as in BrockBrown's answer.)

I'd recommend something like this:

epsilon = 1e-01
loss = 1/(x + epsilon) # 限制最大值为10

or:

eta = 5 # 最大损失值
loss = torch.clamp(loss, max=eta)

As a third option, you can simply check for nan values and not backpropagate in these cases. This won't help with stability, it will just remove outlier loss values, but you still may not get good convergence properties depending on the nature of when these values occur.

if (loss.isnan.sum() == 0): # 无 nan 值
    loss.backward()
    optimizer.step()

I'd also note that if your loss is 1/x, and x is bounded between 0 and 1, you can't ever achieve a loss less than 1. Generally your loss should be 0 for a perfect output. Combining this fact with the previous ideas:

loss = 1/(epsilon + x) - (1/(1+epsilon))

当 x = 1 时,损失接近0,当 x = 0 时,损失大约为1/epsilon。您可以调整 epsilon 以使损失保持在所需范围内以确保稳定性。

英文:

Clip your loss to fall within a reasonable range to prevent gradient explosion (i.e. continually climbing out of the local neighborhood as in BrockBrown's answer.)

I'd recommend something like this:

 epsilon = 1e-01
 loss = 1/(x + epsilon) # bounded maximum value of 10

or:

 eta = 5 # maximum loss value
 loss = torch.clamp(loss,max = eta)

As a third option, you can simply check for nan values and not backpropagate in these cases. This won't help with stability, it will just remove outlier loss values, but you still may not get good convergence properties depending on the nature of when these values occur.

 if (loss.isnan.sum() == 0): # no nan values 
      loss.backward()
      optimizer.step()

I'd also note that if your loss is 1/x, and x is bounded between 0 and 1, you can't ever achieve a loss less than 1. Generally your loss should be 0 for a perfect output. Combining this fact with the previous ideas:

loss = 1/(epsilon + x) - (1/(1+epsilon))

When x = 1, loss approaches 0 , and when x = 0, loss is approximately 1/epsilon. You can adjust epsilon to attenuate your loss within a desired range for stability.

答案2

得分: 2

你的损失值一直在忽高忽低,而不是稳定地减小。你尝试过降低学习率吗?看起来它在最小值附近跳跃,来回反弹。如果学习率太高,就会出现这种情况。

关于你提到的“rewinding”,理想情况下,你不应该需要进行“rewind”,损失值应该稳步减小。你还可以考虑使用学习率调度器

Loss function giving nan in pytorch

英文:

Your loss is jumping all over the place instead of steadily decreasing. Have you tried decreasing your learning rate? It looks like it's jumping across the minimum, bouncing back and forth. This can happen if the learning rate is too high.

To answer your question about rewinding, ideally you shouldn't have to rewind, the loss should be steadily decreasing. Also you may want to look into learning rate schedulers.

Loss function giving nan in pytorch

huangapple
  • 本文由 发表于 2023年3月31日 23:09:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/75900085.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定