在DQN用于Q-learning时,如何在经验回放中应用高伽玛值?

huangapple go评论53阅读模式
英文:

In a DQN for Q-learning, how should I apply high gamma values during experience replay?

问题

我正在使用pyTorch来实现一个Q学习方法来玩一种卡牌游戏,奖励仅在手牌结束时进行计算。我正在使用经验重播和高gamma值(0.5-0.95)来训练网络。

我的问题是如何将折扣奖励应用于重播内存。似乎正确的折扣奖励取决于在某个时刻理解状态转换和奖励的时间顺序,并将折扣递归地应用于终端状态之前。

然而,大多数算法似乎以某种方式将gamma应用于从重播内存中随机选择的一批转换,这似乎不协调它们的时间顺序,并使折扣奖励的计算变得棘手。这些算法中的折扣似乎应用于“next_state”的前向传递,尽管很难解释。

我的方法是在达到终端状态时计算折扣奖励,并将它们直接应用于重播内存的奖励值。在重播时,我不引用gamma,因为它已经被考虑在内。

这对我来说是有道理的,但这不是我在pyTorch的“强化学习(DQN)教程”中看到的例子。有人能解释一下如何管理高gamma Q学习的随机批次的时间去相关性吗?

英文:

I'm using pyTorch to implement a Q-Learning approach to card game, where the rewards come only at the end of the hand when a score is calculated. I am using experience replay with high gammas (0.5-0.95) to train the network.

My question is about how to apply the discounted rewards to the replay memory. It seems that the correct discounted reward depends on understanding, at some point, the temporal sequence of state transitions and rewards, and applying the discount recursively to the from the terminal state.

Yet most algorithms seem to apply the gamma somehow to a randomly-selected batch of transitions from the replay memory, which would seem to do-coordinate them temporally and make calculation of discounted rewards problematic. The discount in these algorithms seems to be applied to a forward pass of the "next_state", although it can be hard to interpret.

My approach has been to calculate the discounted rewards when the terminal state has been reached, and apply them directly to the replay memory's reward values at that time. I do not reference the gamma at replay time, since it has already been factored in.

This makes sense to me, but it is not what I see for example in the pyTorch "[Reinforcement Learning (DQN) Tutorial". Can someone explain how the time-decorrelation in random batches is managed for high-gamma Q-Learning?

答案1

得分: 0

在一场简单的游戏中,你在一个网格上移动并收集硬币。你面临强化学习中的一个常见挑战:奖励延迟到来,很难知道哪些行动是好的或者是坏的。
在Q学习中,你想知道在网格的某个位置(状态)采取某个动作有多好。我们称之为Q值,你可以使用以下公式来计算Q值:

Q(状态, 动作) = 奖励 + gamma * max_next Q(下一个状态, 下一个动作)

Q值是即时奖励和下一步动作中可以获得的最佳Q值。
你将每个动作(状态、动作、奖励、下一个状态)保存在内存中。在训练期间,你会随机选择一些这些动作来更新Q值。这有助于避免过多关注最近的动作。尽管动作是随机选择的,你仍然考虑奖励的顺序。这是通过查看每个动作的下一个状态来完成的,从而允许你预测未来的奖励。这在Q值公式中的gamma * max_next Q(下一个状态, 下一个动作)部分表示。
你等待游戏结束再计算Q值的方法有点不同。它更接近蒙特卡洛方法,其中你只在每场游戏结束时更新Q值。就像先完整地玩整场游戏,然后决定动作有多好。这可能有效,但在游戏很长的情况下可能不太有效。另一方面,传统的Q学习方法会在玩游戏的同时更新Q值。

请记住,在标准的Q学习中,你不需要手动计算和存储折扣奖励。相反,你在Q值更新公式中处理未来奖励的折扣,即使在训练时使用随机批次的转换,未来奖励仍然会在此更新中考虑在内。这就是Q学习如何通过高gamma管理时间去相关性的方式。

英文:

Imagine you are playing a simple game, where you move in a grid and collect coins. you're facing a common challenge in reinforcement learning: rewards come late and it's hard to know which actions were good or bad.
In Q-Learning, you want to know how good it is to take a certain move (action) at a certain spot (state) on the grid. We call this the Q-value where you calculate the Q-value with this formula:

   Q(state, action) = reward + gamma * max_next Q(next_state, next_action)

the Q-value is the immediate reward and the best Q-value you can get in the next move.
You save each move (state, action, reward, next_state) in a memory. During training, you randomly pick some of these moves to update the Q-values. This helps to avoid focusing too much on the recent moves. Although the moves are picked randomly, you still consider the sequence of rewards. This is done by looking at the next state of each move, which allows you to predict future rewards. This is represented by the gamma * max_next Q(next_state, next_action) part in the Q-value formula.
Your approach of waiting until the end of the game to calculate the Q-values is a bit different. It's closer to the Monte Carlo method, where you only update the Q-values at the end of each game. It's like playing the whole game first, then deciding how good the moves were. This might work but could be less effective when the games are long. The traditional Q-Learning method, on the other hand, updates the Q-values as you play the game.

Keep in mind, in standard Q-Learning you don't need to calculate and store discounted rewards manually. Instead, you handle the discounting of future rewards in the Q-value update formula, and even when you use random batches of transitions for training, the future rewards are still taken into account in this update. This is how Q-Learning manages time decorrelation with high gamma.

huangapple
  • 本文由 发表于 2023年5月21日 21:06:26
  • 转载请务必保留本文链接:https://go.coder-hub.com/76300048.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定