英文:
Reinforcement Learning Policy
问题
这个圆圈中的总和为什么不等于1?其原因是,在特定状态s下,采取所有可用行动的概率总和(在状态s下)为1。
所以我的问题是:为什么这个总和不等于1?
英文:
I really have searched thoroughly why this Sum in the circle isn't equal to 1?
The rationale is that being in a specific state s the sum of the probabilities taking all the available actions (while in state s) is 1.
So my question is: why this sum is not equal to one?
答案1
得分: 1
以下是翻译好的部分:
这完全依赖于您的策略函数以获得有效的解释。我将在此线程中使用ε-贪心策略。
ε-贪心策略是一种半贪心算法,通过参数ε进行探索。在伪代码中,ε-贪心策略看起来像这样:
def egreedy(s):
"""s是您当前所处的状态"""
theta <- 0到1之间的随机浮点数
if theta < ε:
返回随机动作(不是贪心动作)
返回贪心动作
简而言之,如果您的ε值为0.1,则有0.9的比例选择贪心动作,以及0.1/(#a-1)的比例选择每个其他动作。#a表示可用动作的数量。例如,如果有5个动作,那么如下所示:
0.9的比例选择贪心动作
0.1/(5-1)=0.025的比例选择每个其他动作。
然后回到您的问题,这些比例加起来等于1,因为所有比例相加应该等于1,这是合乎逻辑的,因为您总是会选择一个动作,因此所有比例应该加起来等于1。
现在,我在这里使用了ε-贪心策略,但是另一种策略可能会产生不在该范围内的其他值。这是因为没有对比例进行归一化。类似Softmax可以解决这个问题。Softmax应用了归一化,使一组值的比例总和为1。这个公式很容易查找和应用,所以我将其留在本线程之外。
总之,sum_a pi(a|s)应该总和为1。如果不是,那么这些值没有被归一化。使用类似Softmax的方法来解决这个问题。
英文:
This is entirely dependent on your policy function for a valid explanation. I will use epsilon-greedy for this thread.
Epsilon-greedy is a semi-greedy algorithm that includes exploration by a parameter epsilon. In pseudocode, epsilon-greedy looks something like this:
def egreedy(s):
"""s is the state you are currently in"""
theta <- random float in [0,1]
if theta < epsilon:
return random action (which is not the greedy action)
return greedy action
To simply put it, if you have an epsilon of 0.1, there is a 0.9 proportion of picking the greedy action and 0.1/(#a-1) proportion for each other action. #a denotes the number of actions available. For example, if there are 5 actions, then the following applies:
0.9 proportion for greedy action
0.1/(5-1)=0.025 for each other action.
Then to get back to your question, this adds up to one, since all proportions combines yields 1, which is only logical since you will always pick an action, and thus all proportions should add up to 1.
Now, I used epsilon-greedy here, but another policy might yield you other values that may not be in that range. This is due to not having the proportions normalized. Something like Softmax would solve this. Softmax applies normalisation such that the proportions of a set of values will add up to 1. This formula is easy to look up and apply, so I will leave this out of this thread.
To conclude, sum_a pi(a|s) should add up to 1. If not, the values are not normalized. Use something like Softmax to solve this.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论