强化学习策略

huangapple go评论101阅读模式
英文:

Reinforcement Learning Policy

问题

这个圆圈中的总和为什么不等于1?其原因是,在特定状态s下,采取所有可用行动的概率总和(在状态s下)为1。

所以我的问题是:为什么这个总和不等于1?

英文:

强化学习策略

I really have searched thoroughly why this Sum in the circle isn't equal to 1?
The rationale is that being in a specific state s the sum of the probabilities taking all the available actions (while in state s) is 1.

So my question is: why this sum is not equal to one?

答案1

得分: 1

以下是翻译好的部分:

这完全依赖于您的策略函数以获得有效的解释。我将在此线程中使用ε-贪心策略。

ε-贪心策略是一种半贪心算法,通过参数ε进行探索。在伪代码中,ε-贪心策略看起来像这样:

def egreedy(s):
    """s是您当前所处的状态"""
    theta <- 0到1之间的随机浮点数
    if theta < ε:
        返回随机动作不是贪心动作
    返回贪心动作

简而言之,如果您的ε值为0.1,则有0.9的比例选择贪心动作,以及0.1/(#a-1)的比例选择每个其他动作。#a表示可用动作的数量。例如,如果有5个动作,那么如下所示:
0.9的比例选择贪心动作
0.1/(5-1)=0.025的比例选择每个其他动作。

然后回到您的问题,这些比例加起来等于1,因为所有比例相加应该等于1,这是合乎逻辑的,因为您总是会选择一个动作,因此所有比例应该加起来等于1。

现在,我在这里使用了ε-贪心策略,但是另一种策略可能会产生不在该范围内的其他值。这是因为没有对比例进行归一化。类似Softmax可以解决这个问题。Softmax应用了归一化,使一组值的比例总和为1。这个公式很容易查找和应用,所以我将其留在本线程之外。

总之,sum_a pi(a|s)应该总和为1。如果不是,那么这些值没有被归一化。使用类似Softmax的方法来解决这个问题。

英文:

This is entirely dependent on your policy function for a valid explanation. I will use epsilon-greedy for this thread.

Epsilon-greedy is a semi-greedy algorithm that includes exploration by a parameter epsilon. In pseudocode, epsilon-greedy looks something like this:

def egreedy(s):
    &quot;&quot;&quot;s is the state you are currently in&quot;&quot;&quot;
    theta &lt;- random float in [0,1]
    if theta &lt; epsilon:
        return random action (which is not the greedy action)
    return greedy action

To simply put it, if you have an epsilon of 0.1, there is a 0.9 proportion of picking the greedy action and 0.1/(#a-1) proportion for each other action. #a denotes the number of actions available. For example, if there are 5 actions, then the following applies:
0.9 proportion for greedy action
0.1/(5-1)=0.025 for each other action.

Then to get back to your question, this adds up to one, since all proportions combines yields 1, which is only logical since you will always pick an action, and thus all proportions should add up to 1.

Now, I used epsilon-greedy here, but another policy might yield you other values that may not be in that range. This is due to not having the proportions normalized. Something like Softmax would solve this. Softmax applies normalisation such that the proportions of a set of values will add up to 1. This formula is easy to look up and apply, so I will leave this out of this thread.

To conclude, sum_a pi(a|s) should add up to 1. If not, the values are not normalized. Use something like Softmax to solve this.

huangapple
  • 本文由 发表于 2023年5月7日 21:24:01
  • 转载请务必保留本文链接:https://go.coder-hub.com/76194194.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定