英文:
How to scale a gradient norm in Keras
问题
在MuZero的伪代码中,他们执行以下操作:
hidden_state = tf.scale_gradient(hidden_state, 0.5)
从这个问题中了解到,这很可能是梯度范数缩放。
在Keras中,如何对隐藏状态进行梯度范数缩放(将梯度范数剪裁到特定长度)?稍后他们还在损失值上进行了相同的缩放:
loss += tf.scale_gradient(l, gradient_scale)
这个网站说我应该在优化器上使用clipnorm
参数。但我认为这不会起作用,因为我在使用优化器之前对梯度进行了缩放。(特别是因为我正在将不同的东西缩放到不同的长度。)
以下是论文中相关代码,如果有帮助的话。(请注意,scale_gradient
不是实际的Tensorflow函数。如果你感到困惑,可以参考先前链接的问题。)
def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
weight_decay: float):
loss = 0
for image, actions, targets in batch:
# Initial step, from the real observation.
value, reward, policy_logits, hidden_state = network.initial_inference(
image)
predictions = [(1.0, value, reward, policy_logits)]
# Recurrent steps, from action and previous hidden state.
for action in actions:
value, reward, policy_logits, hidden_state = network.recurrent_inference(
hidden_state, action)
predictions.append((1.0 / len(actions), value, reward, policy_logits))
# 这一行在这里
hidden_state = tf.scale_gradient(hidden_state, 0.5)
for prediction, target in zip(predictions, targets):
gradient_scale, value, reward, policy_logits = prediction
target_value, target_reward, target_policy = target
l = (
scalar_loss(value, target_value) +
scalar_loss(reward, target_reward) +
tf.nn.softmax_cross_entropy_with_logits(
logits=policy_logits, labels=target_policy))
# 还有这里
loss += tf.scale_gradient(l, gradient_scale)
for weights in network.get_weights():
loss += weight_decay * tf.nn.l2_loss(weights)
optimizer.minimize(loss)
(请注意,这个问题与这个问题不同,后者是关于将梯度乘以一个值,而不是将梯度剪裁到特定的幅度。)
英文:
In the pseudocode for MuZero, they do the following:
hidden_state = tf.scale_gradient(hidden_state, 0.5)
From this question about what this means, I learned that this was likely a gradient norm scaling.
How can I do a gradient norm scaling (clipping the gradient norm to a particular length) on a hidden state in Keras? Later on they also do the same scaling on a loss value:
loss += tf.scale_gradient(l, gradient_scale)
This site says that I should use the clipnorm
parameter on the optimizer. But I don't think that will work, because I'm scaling the gradients before using the optimizer. (And especially since I'm scaling different things to different lengths.)
Here is the particular code in question from the paper, in case it is helpful. (Note that scale_gradient
is not an actual Tensorflow function. See the previously linked question if you are confused, as I was.)
def update_weights(optimizer: tf.train.Optimizer, network: Network, batch,
weight_decay: float):
loss = 0
for image, actions, targets in batch:
# Initial step, from the real observation.
value, reward, policy_logits, hidden_state = network.initial_inference(
image)
predictions = [(1.0, value, reward, policy_logits)]
# Recurrent steps, from action and previous hidden state.
for action in actions:
value, reward, policy_logits, hidden_state = network.recurrent_inference(
hidden_state, action)
predictions.append((1.0 / len(actions), value, reward, policy_logits))
# THIS LINE HERE
hidden_state = tf.scale_gradient(hidden_state, 0.5)
for prediction, target in zip(predictions, targets):
gradient_scale, value, reward, policy_logits = prediction
target_value, target_reward, target_policy = target
l = (
scalar_loss(value, target_value) +
scalar_loss(reward, target_reward) +
tf.nn.softmax_cross_entropy_with_logits(
logits=policy_logits, labels=target_policy))
# AND AGAIN HERE
loss += tf.scale_gradient(l, gradient_scale)
for weights in network.get_weights():
loss += weight_decay * tf.nn.l2_loss(weights)
optimizer.minimize(loss)
(Note that this question is different from this one which is asking about multiplying the gradient by a value, not clipping the gradient to a particular magnitude.)
答案1
得分: 1
你可以使用这里介绍的MaxNorm
约束。
非常简单明了。导入它:from keras.constraints import MaxNorm
如果你想将其应用于权重,在定义Keras层时,可以使用kernel_constraint = MaxNorm(max_value=2, axis=0)
(详细信息请阅读页面上的说明)。
你也可以使用bias_constraint = ...
如果你想将其应用于任何其他张量,只需使用张量调用它:
normalizer = MaxNorm(max_value=2, axis=0)
normalized_tensor = normalizer(original_tensor)
你可以查看源代码,它非常简单:
def __call__(self, w):
norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
desired = K.clip(norms, 0, self.max_value)
return w * (desired / (K.epsilon() + norms))
英文:
You can use the MaxNorm
constraint presented here.
It's very simple and straightforward. Import it from keras.constraints import MaxNorm
If you want to apply it to weights, when you define a Keras layer, you use kernel_constraint = MaxNorm(max_value=2, axis=0)
(read the page for details on axis)
You can also use bias_constraint = ...
If you want to apply it to any other tensor, you can simply call it with a tensor:
normalizer = MaxNorm(max_value=2, axis=0)
normalized_tensor = normalizer(original_tensor)
And you can see the source code is pretty simple:
def __call__(self, w):
norms = K.sqrt(K.sum(K.square(w), axis=self.axis, keepdims=True))
desired = K.clip(norms, 0, self.max_value)
return w * (desired / (K.epsilon() + norms))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论