英文:
Vectorized GYM Environments how to block automatic environment reset on Done = True
问题
SINGLE ENV CODE:
当我在GYM中运行"single"环境时,一旦达到True状态,就不会重置。
When I use the vectorized environments though the reset values are returned as the next_state values immediately.
但是,当我使用矢量化环境时,重置值会立即作为下一个状态值返回。
Is there a way to block that automatic reset behavior in the vectorized environments or is there any other way to record the un-reset Next_State value?
是否有一种方法可以阻止在矢量化环境中的自动重置行为,或者是否有其他方法可以记录未重置的Next_State值?
SINGLE ENV RESULTS:
单一环境结果:
[0.9371 0.1632 0.9866 0.0424] False [ 0.0114 0.0381 -0.0195 -0.0132] [ 0.0121 0.2335 -0.0198 -0.312 ] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7218 0.5444 0.7603 0.5108] False [ 0.0121 0.2335 -0.0198 -0.312 ] [ 0.0168 0.4289 -0.026 -0.6109] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6618 0.6869 0.6806 0.6701] False [ 0.0168 0.4289 -0.026 -0.6109] [ 0.0254 0.6244 -0.0383 -0.9116] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6701 0.7614 0.6772 0.7496] False [ 0.0254 0.6244 -0.0383 -0.9116] [ 0.0379 0.82 -0.0565 -1.2161] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6977 0.8072 0.699 0.797 ] False [ 0.0379 0.82 -0.0565 -1.2161] [ 0.0543 1.0158 -0.0808 -1.5259] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7276 0.8383 0.7259 0.8281] False [ 0.0543 1.0158 -0.0808 -1.5259] [ 0.0746 1.2118 -0.1113 -1.8427] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7547 0.8607 0.7513 0.85 ] False [ 0.0746 1.2118 -0.1113 -1.8427] [ 0.0988 1.408 -0.1482 -2.1678] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7782 0.8777 0.7736 0.8663] False [ 0.0988 1.408 -0.1482 -2.1678] [ 0.127 1.6042 -0.1915 -2.5023] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7983 0.8911 0.7928 0.8789] True [ 0.127 1.6042 -0.1915 -2.5023] [ 0.159 1.8003 -0.2416 -2.8471] [[-0.0429 0.194 -0.0462 -0.2908]]
当[True]时,比率与其他比率一致,当前值不会被重置。
VECTORIZED ENV CODE:
矢量化环境代码:
nn = 1
#env_vect = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1").env for _ in range(nn)])
env_vect = gym.vector.make('CartPole-v1', num_envs=nn)
current_state = env_vect.reset()
print("current_state", current_state)
#print("self.env.state", env_vect.state)
print("self.env.state", env_vect.observations)
for i in range(50):
next_state, reward , done, info = env_vect.step([1 for i in range(nn)])
print(current_state / next_state, done, current_state, next_state, env_vect.observations)
current_state = deepcopy(next_state)
矢量化环境结果:
[[1.0269 0.1803 0.9242 0.1417]] [False] [[-0.0327 0.043 -0.0119 -0.0489]] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0319 0.2382 -0.0129 -0.3454]]
[[1.1757 0.5495 0.6516 0.5379]] [False] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0271 0.4335 -0.0198 -0.6421]]
[[1.4702 0.6893 0.6069 0.6824]] [False] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0184 0.6289 -0.0327 -0.941 ]]
[[3.1457 0.7628 0.6345 0.7566]] [False] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0059 0.8245 -0.0515 -1.2437]] [[-0.0059 0.
英文:
when I run "single" environment in GYM there is no reset once True is achieved
When I use the vectorized environments though the reset values are returned as the next_state values immediately.
Is there a way to block that automatic reset behavior in the vectorized environments or is there any other way to record the un-reset Next_State value?
SINGLE ENV CODE:
import gym
env = gym.make("CartPole-v1")
current_state = env.reset()
for i in range(50):
next_state, reward, done, info = env.step(1)
print(current_state / next_state, done, current_state, next_state, env_vect.observations)
current_state = next_state
SINGLE ENV RESULTS:
[0.9371 0.1632 0.9866 0.0424] False [ 0.0114 0.0381 -0.0195 -0.0132] [ 0.0121 0.2335 -0.0198 -0.312 ] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7218 0.5444 0.7603 0.5108] False [ 0.0121 0.2335 -0.0198 -0.312 ] [ 0.0168 0.4289 -0.026 -0.6109] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6618 0.6869 0.6806 0.6701] False [ 0.0168 0.4289 -0.026 -0.6109] [ 0.0254 0.6244 -0.0383 -0.9116] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6701 0.7614 0.6772 0.7496] False [ 0.0254 0.6244 -0.0383 -0.9116] [ 0.0379 0.82 -0.0565 -1.2161] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.6977 0.8072 0.699 0.797 ] False [ 0.0379 0.82 -0.0565 -1.2161] [ 0.0543 1.0158 -0.0808 -1.5259] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7276 0.8383 0.7259 0.8281] False [ 0.0543 1.0158 -0.0808 -1.5259] [ 0.0746 1.2118 -0.1113 -1.8427] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7547 0.8607 0.7513 0.85 ] False [ 0.0746 1.2118 -0.1113 -1.8427] [ 0.0988 1.408 -0.1482 -2.1678] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7782 0.8777 0.7736 0.8663] False [ 0.0988 1.408 -0.1482 -2.1678] [ 0.127 1.6042 -0.1915 -2.5023] [[-0.0429 0.194 -0.0462 -0.2908]]
[0.7983 0.8911 0.7928 0.8789] True [ 0.127 1.6042 -0.1915 -2.5023] [ 0.159 1.8003 -0.2416 -2.8471] [[-0.0429 0.194 -0.0462 -0.2908]]
When [True], ratio is in line with the other ratios and current values are not reset
VECTORIZED ENV CODE:
nn = 1
#env_vect = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1").env for _ in range(nn)])
env_vect = gym.vector.make('CartPole-v1', num_envs=nn)
current_state = env_vect.reset()
print("current_state", current_state)
#print("self.env.state", env_vect.state)
print("self.env.state", env_vect.observations)
for i in range(50):
next_state, reward , done, info = env_vect.step([1 for i in range(nn)])
print(current_state / next_state, done, current_state, next_state, env_vect.observations)
current_state = deepcopy(next_state)
VECTORIZED ENV RESULTS:
[[1.0269 0.1803 0.9242 0.1417]] [False] [[-0.0327 0.043 -0.0119 -0.0489]] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0319 0.2382 -0.0129 -0.3454]]
[[1.1757 0.5495 0.6516 0.5379]] [False] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0271 0.4335 -0.0198 -0.6421]]
[[1.4702 0.6893 0.6069 0.6824]] [False] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0184 0.6289 -0.0327 -0.941 ]]
[[3.1457 0.7628 0.6345 0.7566]] [False] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0059 0.8245 -0.0515 -1.2437]] [[-0.0059 0.8245 -0.0515 -1.2437]]
[[-0.5516 0.8081 0.6743 0.8013]] [False] [[-0.0059 0.8245 -0.0515 -1.2437]] [[ 0.0106 1.0202 -0.0764 -1.5521]] [[ 0.0106 1.0202 -0.0764 -1.5521]]
[[0.3425 0.8389 0.711 0.8311]] [False] [[ 0.0106 1.0202 -0.0764 -1.5521]] [[ 0.031 1.2162 -0.1074 -1.8676]] [[ 0.031 1.2162 -0.1074 -1.8676]]
[[0.5606 0.8611 0.742 0.8522]] [False] [[ 0.031 1.2162 -0.1074 -1.8676]] [[ 0.0554 1.4123 -0.1448 -2.1916]] [[ 0.0554 1.4123 -0.1448 -2.1916]]
[[0.6621 0.878 0.7676 0.8679]] [False] [[ 0.0554 1.4123 -0.1448 -2.1916]] [[ 0.0836 1.6085 -0.1886 -2.5252]] [[ 0.0836 1.6085 -0.1886 -2.5252]]
[[ -27.6983 -180.4272 4.2333 51.6735]] [ True] [[ 0.0836 1.6085 -0.1886 -2.5252]] [[-0.003 -0.0089 -0.0445 -0.0489]] [[-0.003 -0.0089 -0.0445 -0.0489]]
When [True], ratio is high and current values are reset
答案1
得分: 1
这个解决方案是 GYM 的一个特性,当 "done = True" 时,它会在 "info" 字典中插入一个键 ["terminal_observation"]。您可以提取它并将您的 "next_step" 观察与它交换。
您不能通过查看环境的 step() 函数来确定它是否这样做。当 "done = True" 时,我自己实现了 info["key"],当我要求结果时,我看到了与它们相邻的 ["terminal_observation"] 结果。
英文:
The solution is a feature of GYM that inserts into the "info" dictionary a key ["terminal_observation"] when "done = True". You can extract it and swap your "next_step" observation with it.
You cannot tell that it does so by looking at the environment's step() function. I implemented my own info["key"] when "done = True" and when I asked for the results I saw next to them the ["terminal_observation"] results.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论