为什么我的DQN代理的训练效率这么低?

huangapple go评论59阅读模式
英文:

Why is my DQN-agent's training so inefficient?

问题

我正在尝试训练一个代理程序,让它作为第二个玩家(第一个玩家随机行动)完美地玩井字游戏,使用了来自tf-agents的DQN代理,但是我的训练速度非常慢。

在进行了100,000步的训练后,模型的结果没有任何改善

我理解模型不应该在100,000步内进行训练,但在这个时间段内,一定应该出现一些结果。

老实说,我不完全明白我的学习代码有什么问题...

以下是您提供的Python代码的翻译:

LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000

graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())

q_net = QNetwork(
    tf_env.observation_spec(),
    tf_env.action_spec(),
    fc_layer_params=(100,)
)

train_step_counter = tf.Variable(0)

agent = DqnAgent(
    time_step_spec=tf_env.time_step_spec(),
    action_spec=tf_env.action_spec(),
    q_network=q_net,
    optimizer=Adam(learning_rate=LEARNING_RATE),
    td_errors_loss_fn=common.element_wise_squared_loss,
    epsilon_greedy=0.1,
    train_step_counter=train_step_counter
)
agent.initialize()

eval_policy = agent.policy
collect_policy = agent.collect_policy

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=1000
)

collect_driver = dynamic_step_driver.DynamicStepDriver(
    tf_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)

collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)

initial_collect_policy = random_tf_policy.RandomTFPolicy(
    tf_env.time_step_spec(),
    tf_env.action_spec()
)

dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=8,
    num_steps=2,
    single_deterministic_pass=False
).prefetch(3)

iterator = iter(dataset)

dynamic_step_driver.DynamicStepDriver(
    tf_env,
    initial_collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)

time_step = tf_env.reset()

for _ in np.arange(NUM_ITERATIONS+1):
    time_step, _ = collect_driver.run(time_step)
    experience, _ = next(iterator)

    step = agent.train_step_counter.numpy()
    train_loss = agent.train(experience).loss

    if step % PRINT_PERIOD == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))

    for reward in tf.reshape(experience.reward, [-1]):
        graph.check(step, reward)

我做错了什么吗?

完整项目和环境代码:https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing

英文:

I am trying to train an agent to play tic-tac-toe perfectly as a second player (the first player walks randomly) with the DQN-agent from tf-agents, but my training is extremely slow.

For 100_000 steps, the model did not improve its results in any way

I understand that the model should not be trained in 100 000 steps, but during this period, some results should definitely have appeared.

To be honest, I don't fully understand what's wrong with my learning code...

LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000
 
graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())
 
q_net = QNetwork(
    tf_env.observation_spec(),
    tf_env.action_spec(),
    fc_layer_params=(100,)
)
 
train_step_counter = tf.Variable(0)
 
agent = DqnAgent(
    time_step_spec=tf_env.time_step_spec(),
    action_spec=tf_env.action_spec(),
    q_network=q_net,
    optimizer=Adam(learning_rate=LEARNING_RATE),
    td_errors_loss_fn=common.element_wise_squared_loss,
    epsilon_greedy=0.1,
    train_step_counter=train_step_counter
)
agent.initialize()
 
eval_policy = agent.policy
collect_policy = agent.collect_policy

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=tf_env.batch_size,
    max_length=1000
)
 
collect_driver = dynamic_step_driver.DynamicStepDriver(
    tf_env,
    collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)
 
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)
 
initial_collect_policy = random_tf_policy.RandomTFPolicy(
    tf_env.time_step_spec(),
    tf_env.action_spec()
)
 
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=8,
    num_steps=2,
    single_deterministic_pass=False
).prefetch(3)
     
iterator = iter(dataset)
 
dynamic_step_driver.DynamicStepDriver(
    tf_env,
    initial_collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=10
)
 
time_step = tf_env.reset()

for _ in np.arange(NUM_ITERATIONS+1):
    time_step, _ = collect_driver.run(time_step)
    experience, _ = next(iterator)
 
    step = agent.train_step_counter.numpy()
    train_loss = agent.train(experience).loss

    if step % PRINT_PERIOD == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))
 
    for reward in tf.reshape(experience.reward, [-1]):
        graph.check(step, reward)

What am I doing wrong?

Full project with code of env: https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing

答案1

得分: 1

最有可能是你的 epsilon 值。并不是说这是唯一的问题。

  1. 首先,不要固定在 100000 步,它完全可以在更少的步骤内训练,特别是在这样有限的空间中。
  2. 引入一个更优化的 epsilon,衰减速度更慢,并查看结果如何。尝试一系列不同的值,看看是否可以提高学习效果并减少步骤。
  3. 尝试自己分解 DQN 内置函数,看看还有哪些超参数可以调整以适应你的问题。
英文:

Most likely it's your epsilon value. Not saying it's the only problem.

  1. First of all don't fixate yourself to the 100000 steps, it could definitely train in less than that, especially in such a finite space.
  2. Introduce a better optimal epsilon that decays much more slowly and see what happens. Test with a range of values and see if your learning gets better and steps are reduced.
  3. Try breaking down that DQN built-in function yourself and see what more hyperparameters you might tweak to fit your problem.

huangapple
  • 本文由 发表于 2023年5月28日 06:42:14
  • 转载请务必保留本文链接:https://go.coder-hub.com/76349313.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定