如何在Keras中缩放梯度范数

huangapple go评论153阅读模式
英文:

How to scale a gradient norm in Keras

问题

在MuZero的伪代码中,他们执行以下操作:

hidden_state = tf.scale_gradient(hidden_state, 0.5)

这个问题中了解到,这很可能是梯度范数缩放。

在Keras中,如何对隐藏状态进行梯度范数缩放(将梯度范数剪裁到特定长度)?稍后他们还在损失值上进行了相同的缩放:

loss += tf.scale_gradient(l, gradient_scale)

这个网站说我应该在优化器上使用clipnorm参数。但我认为这不会起作用,因为我在使用优化器之前对梯度进行了缩放。(特别是因为我正在将不同的东西缩放到不同的长度。)

以下是论文中相关代码,如果有帮助的话。(请注意,scale_gradient不是实际的Tensorflow函数。如果你感到困惑,可以参考先前链接的问题。)

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      # 这一行在这里
      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      # 还有这里
      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)

(请注意,这个问题与这个问题不同,后者是关于将梯度乘以一个值,而不是将梯度剪裁到特定的幅度。)

英文:

In the pseudocode for MuZero, they do the following:

hidden_state = tf.scale_gradient(hidden_state, 0.5)

From this question about what this means, I learned that this was likely a gradient norm scaling.

How can I do a gradient norm scaling (clipping the gradient norm to a particular length) on a hidden state in Keras? Later on they also do the same scaling on a loss value:

loss += tf.scale_gradient(l, gradient_scale)

This site says that I should use the clipnorm parameter on the optimizer. But I don't think that will work, because I'm scaling the gradients before using the optimizer. (And especially since I'm scaling different things to different lengths.)

Here is the particular code in question from the paper, in case it is helpful. (Note that scale_gradient is not an actual Tensorflow function. See the previously linked question if you are confused, as I was.)

def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
                   weight_decay: float):
  loss = 0
  for image, actions, targets in batch:
    # Initial step, from the real observation.
    value, reward, policy_logits, hidden_state = network.initial_inference(
        image)
    predictions = [(1.0, value, reward, policy_logits)]

    # Recurrent steps, from action and previous hidden state.
    for action in actions:
      value, reward, policy_logits, hidden_state = network.recurrent_inference(
          hidden_state, action)
      predictions.append((1.0 / len(actions), value, reward, policy_logits))

      # THIS LINE HERE
      hidden_state = tf.scale_gradient(hidden_state, 0.5)

    for prediction, target in zip(predictions, targets):
      gradient_scale, value, reward, policy_logits = prediction
      target_value, target_reward, target_policy = target

      l = (
          scalar_loss(value, target_value) +
          scalar_loss(reward, target_reward) +
          tf.nn.softmax_cross_entropy_with_logits(
              logits=policy_logits, labels=target_policy))

      # AND AGAIN HERE
      loss += tf.scale_gradient(l, gradient_scale)

  for weights in network.get_weights():
    loss += weight_decay * tf.nn.l2_loss(weights)

  optimizer.minimize(loss)

(Note that this question is different from this one which is asking about multiplying the gradient by a value, not clipping the gradient to a particular magnitude.)

答案1

得分: 1

你可以使用这里介绍的MaxNorm约束。

非常简单明了。导入它:from keras.constraints import MaxNorm

如果你想将其应用于权重,在定义Keras层时,可以使用kernel_constraint = MaxNorm(max_value=2, axis=0)(详细信息请阅读页面上的说明)。

你也可以使用bias_constraint = ...

如果你想将其应用于任何其他张量,只需使用张量调用它:

normalizer = MaxNorm(max_value=2, axis=0)
normalized_tensor = normalizer(original_tensor)

你可以查看源代码,它非常简单:

def __call__(self, w):
    norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
    desired = K.clip(norms, 0, self.max_value)
    return w * (desired / (K.epsilon() + norms))
英文:

You can use the MaxNorm constraint presented here.

It's very simple and straightforward. Import it from keras.constraints import MaxNorm

If you want to apply it to weights, when you define a Keras layer, you use kernel_constraint = MaxNorm(max_value=2, axis=0) (read the page for details on axis)

You can also use bias_constraint = ...

If you want to apply it to any other tensor, you can simply call it with a tensor:

normalizer = MaxNorm(max_value=2, axis=0)
normalized_tensor = normalizer(original_tensor)

And you can see the source code is pretty simple:

def __call__(self, w):
    norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
    desired = K.clip(norms, 0, self.max_value)
    return w * (desired / (K.epsilon() + norms))

huangapple
  • 本文由 发表于 2020年1月7日 01:17:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/59616311.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定