如何在`Done = True`时阻止自动环境重置的矢量化GYM环境。

huangapple go评论59阅读模式
英文:

Vectorized GYM Environments how to block automatic environment reset on Done = True

问题

SINGLE ENV CODE:

当我在GYM中运行"single"环境时,一旦达到True状态,就不会重置。

When I use the vectorized environments though the reset values are returned as the next_state values immediately.

但是,当我使用矢量化环境时,重置值会立即作为下一个状态值返回。

Is there a way to block that automatic reset behavior in the vectorized environments or is there any other way to record the un-reset Next_State value?

是否有一种方法可以阻止在矢量化环境中的自动重置行为,或者是否有其他方法可以记录未重置的Next_State值?

SINGLE ENV RESULTS:

单一环境结果:

[0.9371 0.1632 0.9866 0.0424] False [ 0.0114 0.0381 -0.0195 -0.0132] [ 0.0121 0.2335 -0.0198 -0.312 ] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7218 0.5444 0.7603 0.5108] False [ 0.0121 0.2335 -0.0198 -0.312 ] [ 0.0168 0.4289 -0.026 -0.6109] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6618 0.6869 0.6806 0.6701] False [ 0.0168 0.4289 -0.026 -0.6109] [ 0.0254 0.6244 -0.0383 -0.9116] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6701 0.7614 0.6772 0.7496] False [ 0.0254 0.6244 -0.0383 -0.9116] [ 0.0379 0.82 -0.0565 -1.2161] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6977 0.8072 0.699 0.797 ] False [ 0.0379 0.82 -0.0565 -1.2161] [ 0.0543 1.0158 -0.0808 -1.5259] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7276 0.8383 0.7259 0.8281] False [ 0.0543 1.0158 -0.0808 -1.5259] [ 0.0746 1.2118 -0.1113 -1.8427] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7547 0.8607 0.7513 0.85 ] False [ 0.0746 1.2118 -0.1113 -1.8427] [ 0.0988 1.408 -0.1482 -2.1678] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7782 0.8777 0.7736 0.8663] False [ 0.0988 1.408 -0.1482 -2.1678] [ 0.127 1.6042 -0.1915 -2.5023] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7983 0.8911 0.7928 0.8789] True [ 0.127 1.6042 -0.1915 -2.5023] [ 0.159 1.8003 -0.2416 -2.8471] [[-0.0429 0.194 -0.0462 -0.2908]]

当[True]时,比率与其他比率一致,当前值不会被重置。

VECTORIZED ENV CODE:

矢量化环境代码:

nn = 1

#env_vect = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1").env for _ in range(nn)])
env_vect = gym.vector.make('CartPole-v1', num_envs=nn)

current_state = env_vect.reset()

print("current_state", current_state)
#print("self.env.state", env_vect.state)
print("self.env.state", env_vect.observations)

for i in range(50):
next_state, reward , done, info = env_vect.step([1 for i in range(nn)])
print(current_state / next_state, done, current_state, next_state, env_vect.observations)
current_state = deepcopy(next_state)

矢量化环境结果:

[[1.0269 0.1803 0.9242 0.1417]] [False] [[-0.0327 0.043 -0.0119 -0.0489]] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0319 0.2382 -0.0129 -0.3454]]

[[1.1757 0.5495 0.6516 0.5379]] [False] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0271 0.4335 -0.0198 -0.6421]]

[[1.4702 0.6893 0.6069 0.6824]] [False] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0184 0.6289 -0.0327 -0.941 ]]

[[3.1457 0.7628 0.6345 0.7566]] [False] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0059 0.8245 -0.0515 -1.2437]] [[-0.0059 0.

英文:

when I run "single" environment in GYM there is no reset once True is achieved

When I use the vectorized environments though the reset values are returned as the next_state values immediately.

Is there a way to block that automatic reset behavior in the vectorized environments or is there any other way to record the un-reset Next_State value?

SINGLE ENV CODE:

import gym

env = gym.make("CartPole-v1")
current_state = env.reset()


for i in range(50):
    next_state, reward, done, info = env.step(1)
    print(current_state / next_state, done, current_state, next_state, env_vect.observations)
    current_state = next_state

SINGLE ENV RESULTS:


[0.9371 0.1632 0.9866 0.0424] False [ 0.0114 0.0381 -0.0195 -0.0132] [ 0.0121 0.2335 -0.0198 -0.312 ] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7218 0.5444 0.7603 0.5108] False [ 0.0121 0.2335 -0.0198 -0.312 ] [ 0.0168 0.4289 -0.026 -0.6109] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6618 0.6869 0.6806 0.6701] False [ 0.0168 0.4289 -0.026 -0.6109] [ 0.0254 0.6244 -0.0383 -0.9116] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6701 0.7614 0.6772 0.7496] False [ 0.0254 0.6244 -0.0383 -0.9116] [ 0.0379 0.82 -0.0565 -1.2161] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.6977 0.8072 0.699 0.797 ] False [ 0.0379 0.82 -0.0565 -1.2161] [ 0.0543 1.0158 -0.0808 -1.5259] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7276 0.8383 0.7259 0.8281] False [ 0.0543 1.0158 -0.0808 -1.5259] [ 0.0746 1.2118 -0.1113 -1.8427] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7547 0.8607 0.7513 0.85 ] False [ 0.0746 1.2118 -0.1113 -1.8427] [ 0.0988 1.408 -0.1482 -2.1678] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7782 0.8777 0.7736 0.8663] False [ 0.0988 1.408 -0.1482 -2.1678] [ 0.127 1.6042 -0.1915 -2.5023] [[-0.0429 0.194 -0.0462 -0.2908]]

[0.7983 0.8911 0.7928 0.8789] True [ 0.127 1.6042 -0.1915 -2.5023] [ 0.159 1.8003 -0.2416 -2.8471] [[-0.0429 0.194 -0.0462 -0.2908]]
When [True], ratio is in line with the other ratios and current values are not reset


VECTORIZED ENV CODE:

nn = 1

#env_vect = gym.vector.SyncVectorEnv([lambda: gym.make("CartPole-v1").env for _ in range(nn)])
env_vect = gym.vector.make('CartPole-v1', num_envs=nn)

current_state = env_vect.reset()

print("current_state", current_state)
#print("self.env.state", env_vect.state)
print("self.env.state", env_vect.observations)

for i in range(50):
    next_state, reward , done, info = env_vect.step([1 for i in range(nn)])
    print(current_state / next_state, done, current_state, next_state, env_vect.observations)
    current_state = deepcopy(next_state)

VECTORIZED ENV RESULTS:


[[1.0269 0.1803 0.9242 0.1417]] [False] [[-0.0327 0.043 -0.0119 -0.0489]] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0319 0.2382 -0.0129 -0.3454]]

[[1.1757 0.5495 0.6516 0.5379]] [False] [[-0.0319 0.2382 -0.0129 -0.3454]] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0271 0.4335 -0.0198 -0.6421]]

[[1.4702 0.6893 0.6069 0.6824]] [False] [[-0.0271 0.4335 -0.0198 -0.6421]] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0184 0.6289 -0.0327 -0.941 ]]

[[3.1457 0.7628 0.6345 0.7566]] [False] [[-0.0184 0.6289 -0.0327 -0.941 ]] [[-0.0059 0.8245 -0.0515 -1.2437]] [[-0.0059 0.8245 -0.0515 -1.2437]]

[[-0.5516 0.8081 0.6743 0.8013]] [False] [[-0.0059 0.8245 -0.0515 -1.2437]] [[ 0.0106 1.0202 -0.0764 -1.5521]] [[ 0.0106 1.0202 -0.0764 -1.5521]]

[[0.3425 0.8389 0.711 0.8311]] [False] [[ 0.0106 1.0202 -0.0764 -1.5521]] [[ 0.031 1.2162 -0.1074 -1.8676]] [[ 0.031 1.2162 -0.1074 -1.8676]]

[[0.5606 0.8611 0.742 0.8522]] [False] [[ 0.031 1.2162 -0.1074 -1.8676]] [[ 0.0554 1.4123 -0.1448 -2.1916]] [[ 0.0554 1.4123 -0.1448 -2.1916]]

[[0.6621 0.878 0.7676 0.8679]] [False] [[ 0.0554 1.4123 -0.1448 -2.1916]] [[ 0.0836 1.6085 -0.1886 -2.5252]] [[ 0.0836 1.6085 -0.1886 -2.5252]]

[[ -27.6983 -180.4272 4.2333 51.6735]] [ True] [[ 0.0836 1.6085 -0.1886 -2.5252]] [[-0.003 -0.0089 -0.0445 -0.0489]] [[-0.003 -0.0089 -0.0445 -0.0489]]

When [True], ratio is high and current values are reset


答案1

得分: 1

这个解决方案是 GYM 的一个特性,当 "done = True" 时,它会在 "info" 字典中插入一个键 ["terminal_observation"]。您可以提取它并将您的 "next_step" 观察与它交换。

您不能通过查看环境的 step() 函数来确定它是否这样做。当 "done = True" 时,我自己实现了 info["key"],当我要求结果时,我看到了与它们相邻的 ["terminal_observation"] 结果。

英文:

The solution is a feature of GYM that inserts into the "info" dictionary a key ["terminal_observation"] when "done = True". You can extract it and swap your "next_step" observation with it.

You cannot tell that it does so by looking at the environment's step() function. I implemented my own info["key"] when "done = True" and when I asked for the results I saw next to them the ["terminal_observation"] results.

huangapple
  • 本文由 发表于 2023年2月24日 09:27:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75551863.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定