Summary
In recent highway-env releases (>=1.6, after the self.steps -> sub-step counting change), parking-v0 episodes effectively run 500 policy steps instead of the 100 the canonical RL-Zoo / SB3 recipes were tuned for.
Where the 5x comes from
Default parking-v0 config:
duration: 100
policy_frequency: 5
- truncation:
_is_truncated -> self.time >= duration
self.time advances by 1/policy_frequency = 0.2 per policy step, so:
steps_per_episode = duration / (1/policy_frequency) = 100 / 0.2 = 500
SB3 / SBX still treat one policy step as 1/policy_frequency simulated seconds, so from their side an episode now runs 500 timesteps, but the published RL-Zoo recipes (rl-baselines3-zoo/hyperparams) for parking-v0 still assume 100:
parking-v0:
learning_starts: 100
replay_buffer_class: HerReplayBuffer
replay_buffer_kwargs:
max_episode_length: 100
So out-of-the-box on modern highway-env:
learning_starts (100) << actual_episode_length (500)
- HER
max_episode_length (100) << actual_episode_length (500)
This produces e.g. RuntimeError: Unable to sample before the end of the first episode. — exactly the report in DLR-RM/rl-baselines3-zoo#433, where the maintainer's suggestion was a per-user workaround (raise learning_starts above the maximum env timesteps), not a fix to the underlying mismatch.
What we'd like to understand
Setting duration: 20 locally restores 100-step episodes (20 / 0.2 = 100) and every published RL-Zoo / SB3 hyperparameter (learning_starts=100, HER max_episode_length=100, n_timesteps=1e5) becomes consistent again without further changes.
Two concrete questions:
- Why was the episode horizon effectively increased 5x (
duration reinterpreted from "policy steps" to "seconds of simulated time")? Is this intentional?
- Is
duration: 20 an acceptable canonical fix to restore consistency with the documented RL-Zoo recipe, or is there a reason maintainers prefer to keep duration: 100 and instead update every dependent hyperparameter?
Thanks!
Summary
In recent highway-env releases (>=1.6, after the
self.steps-> sub-step counting change),parking-v0episodes effectively run 500 policy steps instead of the 100 the canonical RL-Zoo / SB3 recipes were tuned for.Where the 5x comes from
Default
parking-v0config:duration: 100policy_frequency: 5_is_truncated -> self.time >= durationself.timeadvances by1/policy_frequency = 0.2per policy step, so:steps_per_episode = duration / (1/policy_frequency) = 100 / 0.2 = 500SB3 / SBX still treat one policy step as
1/policy_frequencysimulated seconds, so from their side an episode now runs 500 timesteps, but the published RL-Zoo recipes (rl-baselines3-zoo/hyperparams) forparking-v0still assume 100:So out-of-the-box on modern highway-env:
learning_starts (100) << actual_episode_length (500)max_episode_length (100) << actual_episode_length (500)This produces e.g.
RuntimeError: Unable to sample before the end of the first episode.— exactly the report in DLR-RM/rl-baselines3-zoo#433, where the maintainer's suggestion was a per-user workaround (raiselearning_startsabove the maximum env timesteps), not a fix to the underlying mismatch.What we'd like to understand
Setting
duration: 20locally restores 100-step episodes (20 / 0.2 = 100) and every published RL-Zoo / SB3 hyperparameter (learning_starts=100, HERmax_episode_length=100,n_timesteps=1e5) becomes consistent again without further changes.Two concrete questions:
durationreinterpreted from "policy steps" to "seconds of simulated time")? Is this intentional?duration: 20an acceptable canonical fix to restore consistency with the documented RL-Zoo recipe, or is there a reason maintainers prefer to keepduration: 100and instead update every dependent hyperparameter?Thanks!