问题

这个圆圈中的总和为什么不等于1？其原因是，在特定状态s下，采取所有可用行动的概率总和（在状态s下）为1。

所以我的问题是：为什么这个总和不等于1？

英文:

I really have searched thoroughly why this Sum in the circle isn't equal to 1?
The rationale is that being in a specific state s the sum of the probabilities taking all the available actions (while in state s) is 1.

So my question is: why this sum is not equal to one?

答案1

得分: 1

以下是翻译好的部分：

这完全依赖于您的策略函数以获得有效的解释。我将在此线程中使用ε-贪心策略。

ε-贪心策略是一种半贪心算法，通过参数ε进行探索。在伪代码中，ε-贪心策略看起来像这样：

def egreedy(s):
    """s是您当前所处的状态"""
    theta <- 0到1之间的随机浮点数
    if theta < ε:
        返回随机动作（不是贪心动作）
    返回贪心动作

简而言之，如果您的ε值为0.1，则有0.9的比例选择贪心动作，以及0.1/(#a-1)的比例选择每个其他动作。#a表示可用动作的数量。例如，如果有5个动作，那么如下所示：
0.9的比例选择贪心动作
0.1/(5-1)=0.025的比例选择每个其他动作。

然后回到您的问题，这些比例加起来等于1，因为所有比例相加应该等于1，这是合乎逻辑的，因为您总是会选择一个动作，因此所有比例应该加起来等于1。

现在，我在这里使用了ε-贪心策略，但是另一种策略可能会产生不在该范围内的其他值。这是因为没有对比例进行归一化。类似Softmax可以解决这个问题。Softmax应用了归一化，使一组值的比例总和为1。这个公式很容易查找和应用，所以我将其留在本线程之外。

总之，sum_a pi(a|s)应该总和为1。如果不是，那么这些值没有被归一化。使用类似Softmax的方法来解决这个问题。

英文:

This is entirely dependent on your policy function for a valid explanation. I will use epsilon-greedy for this thread.

Epsilon-greedy is a semi-greedy algorithm that includes exploration by a parameter epsilon. In pseudocode, epsilon-greedy looks something like this:

def egreedy(s):
    &quot;&quot;&quot;s is the state you are currently in&quot;&quot;&quot;
    theta &lt;- random float in [0,1]
    if theta &lt; epsilon:
        return random action (which is not the greedy action)
    return greedy action

To simply put it, if you have an epsilon of 0.1, there is a 0.9 proportion of picking the greedy action and 0.1/(#a-1) proportion for each other action. #a denotes the number of actions available. For example, if there are 5 actions, then the following applies:
0.9 proportion for greedy action
0.1/(5-1)=0.025 for each other action.

Then to get back to your question, this adds up to one, since all proportions combines yields 1, which is only logical since you will always pick an action, and thus all proportions should add up to 1.

Now, I used epsilon-greedy here, but another policy might yield you other values that may not be in that range. This is due to not having the proportions normalized. Something like Softmax would solve this. Softmax applies normalisation such that the proportions of a set of values will add up to 1. This formula is easy to look up and apply, so I will leave this out of this thread.

To conclude, sum_a pi(a|s) should add up to 1. If not, the values are not normalized. Use something like Softmax to solve this.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

强化学习策略

问题

答案1

ValueError: not enough values to unpack (expected 5, got 4) when using nes_py and gym_super_mario_bros

在DQN用于Q-learning时，如何在经验回放中应用高伽玛值？

如何在`Done = True`时阻止自动环境重置的矢量化GYM环境。

如何将使用stable-baselines3创建的A2C模型导出到PyTorch？

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论