英文:
Why is my DQN-agent's training so inefficient?
问题
我正在尝试训练一个代理程序,让它作为第二个玩家(第一个玩家随机行动)完美地玩井字游戏,使用了来自tf-agents的DQN代理,但是我的训练速度非常慢。
我理解模型不应该在100,000步内进行训练,但在这个时间段内,一定应该出现一些结果。
老实说,我不完全明白我的学习代码有什么问题...
以下是您提供的Python代码的翻译:
LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000
graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())
q_net = QNetwork(
tf_env.observation_spec(),
tf_env.action_spec(),
fc_layer_params=(100,)
)
train_step_counter = tf.Variable(0)
agent = DqnAgent(
time_step_spec=tf_env.time_step_spec(),
action_spec=tf_env.action_spec(),
q_network=q_net,
optimizer=Adam(learning_rate=LEARNING_RATE),
td_errors_loss_fn=common.element_wise_squared_loss,
epsilon_greedy=0.1,
train_step_counter=train_step_counter
)
agent.initialize()
eval_policy = agent.policy
collect_policy = agent.collect_policy
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=tf_env.batch_size,
max_length=1000
)
collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
collect_policy,
observers=[replay_buffer.add_batch],
num_steps=10
)
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)
initial_collect_policy = random_tf_policy.RandomTFPolicy(
tf_env.time_step_spec(),
tf_env.action_spec()
)
dataset = replay_buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=8,
num_steps=2,
single_deterministic_pass=False
).prefetch(3)
iterator = iter(dataset)
dynamic_step_driver.DynamicStepDriver(
tf_env,
initial_collect_policy,
observers=[replay_buffer.add_batch],
num_steps=10
)
time_step = tf_env.reset()
for _ in np.arange(NUM_ITERATIONS+1):
time_step, _ = collect_driver.run(time_step)
experience, _ = next(iterator)
step = agent.train_step_counter.numpy()
train_loss = agent.train(experience).loss
if step % PRINT_PERIOD == 0:
print('step = {0}: loss = {1}'.format(step, train_loss))
for reward in tf.reshape(experience.reward, [-1]):
graph.check(step, reward)
我做错了什么吗?
完整项目和环境代码:https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing
英文:
I am trying to train an agent to play tic-tac-toe perfectly as a second player (the first player walks randomly) with the DQN-agent from tf-agents, but my training is extremely slow.
For 100_000 steps, the model did not improve its results in any way
I understand that the model should not be trained in 100 000 steps, but during this period, some results should definitely have appeared.
To be honest, I don't fully understand what's wrong with my learning code...
LOG_PERIOD = 1000
PRINT_PERIOD = 100
LEARNING_RATE = 0.001
NUM_ITERATIONS = 100_000
graph = Graphic(LOG_PERIOD)
tf_env = TFPyEnvironment(RandomTicTacToeEnvironment())
q_net = QNetwork(
tf_env.observation_spec(),
tf_env.action_spec(),
fc_layer_params=(100,)
)
train_step_counter = tf.Variable(0)
agent = DqnAgent(
time_step_spec=tf_env.time_step_spec(),
action_spec=tf_env.action_spec(),
q_network=q_net,
optimizer=Adam(learning_rate=LEARNING_RATE),
td_errors_loss_fn=common.element_wise_squared_loss,
epsilon_greedy=0.1,
train_step_counter=train_step_counter
)
agent.initialize()
eval_policy = agent.policy
collect_policy = agent.collect_policy
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=tf_env.batch_size,
max_length=1000
)
collect_driver = dynamic_step_driver.DynamicStepDriver(
tf_env,
collect_policy,
observers=[replay_buffer.add_batch],
num_steps=10
)
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)
initial_collect_policy = random_tf_policy.RandomTFPolicy(
tf_env.time_step_spec(),
tf_env.action_spec()
)
dataset = replay_buffer.as_dataset(
num_parallel_calls=3,
sample_batch_size=8,
num_steps=2,
single_deterministic_pass=False
).prefetch(3)
iterator = iter(dataset)
dynamic_step_driver.DynamicStepDriver(
tf_env,
initial_collect_policy,
observers=[replay_buffer.add_batch],
num_steps=10
)
time_step = tf_env.reset()
for _ in np.arange(NUM_ITERATIONS+1):
time_step, _ = collect_driver.run(time_step)
experience, _ = next(iterator)
step = agent.train_step_counter.numpy()
train_loss = agent.train(experience).loss
if step % PRINT_PERIOD == 0:
print('step = {0}: loss = {1}'.format(step, train_loss))
for reward in tf.reshape(experience.reward, [-1]):
graph.check(step, reward)
What am I doing wrong?
Full project with code of env: https://colab.research.google.com/drive/1myp2aRAd03PP2RoPq1L9rxuaxHcnJf_U?usp=sharing
答案1
得分: 1
最有可能是你的 epsilon 值。并不是说这是唯一的问题。
- 首先,不要固定在 100000 步,它完全可以在更少的步骤内训练,特别是在这样有限的空间中。
- 引入一个更优化的 epsilon,衰减速度更慢,并查看结果如何。尝试一系列不同的值,看看是否可以提高学习效果并减少步骤。
- 尝试自己分解 DQN 内置函数,看看还有哪些超参数可以调整以适应你的问题。
英文:
Most likely it's your epsilon value. Not saying it's the only problem.
- First of all don't fixate yourself to the 100000 steps, it could definitely train in less than that, especially in such a finite space.
- Introduce a better optimal epsilon that decays much more slowly and see what happens. Test with a range of values and see if your learning gets better and steps are reduced.
- Try breaking down that DQN built-in function yourself and see what more hyperparameters you might tweak to fit your problem.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论