问题

我正在尝试训练一个代理程序，让它作为第二个玩家（第一个玩家随机行动）完美地玩井字游戏，使用了来自tf-agents的DQN代理，但是我的训练速度非常慢。

在进行了100,000步的训练后，模型的结果没有任何改善。

我理解模型不应该在100,000步内进行训练，但在这个时间段内，一定应该出现一些结果。

老实说，我不完全明白我的学习代码有什么问题...

以下是您提供的Python代码的翻译：

LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000

graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())

q_net = QNetwork(
    tf_env.observation_spec(),
    tf_env.action_spec(),
    fc_layer_params=(100,)
)

train_step_counter = tf.Variable(0)

agent = DqnAgent(
    time_step_spec=tf_env.time_step_spec(),
    action_spec=tf_env.action_spec(),
    q_network=q_net,
    optimizer=Adam(learning_rate=LEARNING_RATE),
    td_errors_loss_fn=common.element_wise_squared_loss,
    epsilon_greedy=0.1,
    train_step_counter=train_step_counter
)
agent.initialize()

eval_policy = agent.policy
collect_policy = agent.collect_policy

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=1000
)

collect_driver = dynamic_step_driver.DynamicStepDriver(
    tf_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)

collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)

initial_collect_policy = random_tf_policy.RandomTFPolicy(
    tf_env.time_step_spec(),
    tf_env.action_spec()
)

dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=8,
    num_steps=2,
    single_deterministic_pass=False
).prefetch(3)

iterator = iter(dataset)

dynamic_step_driver.DynamicStepDriver(
    tf_env,
    initial_collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)

time_step = tf_env.reset()

for _ in np.arange(NUM_ITERATIONS+1):
    time_step, _ = collect_driver.run(time_step)
    experience, _ = next(iterator)

    step = agent.train_step_counter.numpy()
    train_loss = agent.train(experience).loss

    if step % PRINT_PERIOD == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))

    for reward in tf.reshape(experience.reward, [-1]):
        graph.check(step, reward)

我做错了什么吗？

完整项目和环境代码：https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing

英文:

I am trying to train an agent to play tic-tac-toe perfectly as a second player (the first player walks randomly) with the DQN-agent from tf-agents, but my training is extremely slow.

For 100_000 steps, the model did not improve its results in any way

I understand that the model should not be trained in 100 000 steps, but during this period, some results should definitely have appeared.

To be honest, I don't fully understand what's wrong with my learning code...

LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000
 
graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())
 
q_net = QNetwork(
    tf_env.observation_spec(),
    tf_env.action_spec(),
    fc_layer_params=(100,)
)
 
train_step_counter = tf.Variable(0)
 
agent = DqnAgent(
    time_step_spec=tf_env.time_step_spec(),
    action_spec=tf_env.action_spec(),
    q_network=q_net,
    optimizer=Adam(learning_rate=LEARNING_RATE),
    td_errors_loss_fn=common.element_wise_squared_loss,
    epsilon_greedy=0.1,
    train_step_counter=train_step_counter
)
agent.initialize()
 
eval_policy = agent.policy
collect_policy = agent.collect_policy

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=1000
)
 
collect_driver = dynamic_step_driver.DynamicStepDriver(
    tf_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)
 
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)
 
initial_collect_policy = random_tf_policy.RandomTFPolicy(
    tf_env.time_step_spec(),
    tf_env.action_spec()
)
 
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=8,
    num_steps=2,
    single_deterministic_pass=False
).prefetch(3)
     
iterator = iter(dataset)
 
dynamic_step_driver.DynamicStepDriver(
    tf_env,
    initial_collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)
 
time_step = tf_env.reset()

for _ in np.arange(NUM_ITERATIONS+1):
    time_step, _ = collect_driver.run(time_step)
    experience, _ = next(iterator)
 
    step = agent.train_step_counter.numpy()
    train_loss = agent.train(experience).loss

    if step % PRINT_PERIOD == 0:
        print(&#39;step = {0}: loss = {1}&#39;.format(step, train_loss))
 
    for reward in tf.reshape(experience.reward, [-1]):
        graph.check(step, reward)

What am I doing wrong?

Full project with code of env: https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing

答案1

得分: 1

最有可能是你的 epsilon 值。并不是说这是唯一的问题。

首先，不要固定在 100000 步，它完全可以在更少的步骤内训练，特别是在这样有限的空间中。
引入一个更优化的 epsilon，衰减速度更慢，并查看结果如何。尝试一系列不同的值，看看是否可以提高学习效果并减少步骤。
尝试自己分解 DQN 内置函数，看看还有哪些超参数可以调整以适应你的问题。

英文:

Most likely it's your epsilon value. Not saying it's the only problem.

First of all don't fixate yourself to the 100000 steps, it could definitely train in less than that, especially in such a finite space.
Introduce a better optimal epsilon that decays much more slowly and see what happens. Test with a range of values and see if your learning gets better and steps are reduced.
Try breaking down that DQN built-in function yourself and see what more hyperparameters you might tweak to fit your problem.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

为什么我的DQN代理的训练效率这么低？

问题

答案1

Deployed dolly2 model in Sagemaker for embeddings, but receiving a 400 error when calling endpoint

像素加权分类交叉熵用于语义分割

“OpenAI Gym Mario模型用于强化学习中的数值过多/不足”

Tensorflow Keras层的形状错误

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论