Skip to content

parking-v0: duration semantics yield 500-step episodes — RL-Zoo / SB3 default hyperparameters silently break #674

@Deepakgthomas

Description

@Deepakgthomas

Summary

In recent highway-env releases (>=1.6, after the self.steps -> sub-step counting change), parking-v0 episodes effectively run 500 policy steps instead of the 100 the canonical RL-Zoo / SB3 recipes were tuned for.

Where the 5x comes from

Default parking-v0 config:

  • duration: 100
  • policy_frequency: 5
  • truncation: _is_truncated -> self.time >= duration

self.time advances by 1/policy_frequency = 0.2 per policy step, so:

steps_per_episode = duration / (1/policy_frequency) = 100 / 0.2 = 500

SB3 / SBX still treat one policy step as 1/policy_frequency simulated seconds, so from their side an episode now runs 500 timesteps, but the published RL-Zoo recipes (rl-baselines3-zoo/hyperparams) for parking-v0 still assume 100:

parking-v0:
  learning_starts: 100
  replay_buffer_class: HerReplayBuffer
  replay_buffer_kwargs:
    max_episode_length: 100

So out-of-the-box on modern highway-env:

  • learning_starts (100) << actual_episode_length (500)
  • HER max_episode_length (100) << actual_episode_length (500)

This produces e.g. RuntimeError: Unable to sample before the end of the first episode. — exactly the report in DLR-RM/rl-baselines3-zoo#433, where the maintainer's suggestion was a per-user workaround (raise learning_starts above the maximum env timesteps), not a fix to the underlying mismatch.

What we'd like to understand

Setting duration: 20 locally restores 100-step episodes (20 / 0.2 = 100) and every published RL-Zoo / SB3 hyperparameter (learning_starts=100, HER max_episode_length=100, n_timesteps=1e5) becomes consistent again without further changes.

Two concrete questions:

  1. Why was the episode horizon effectively increased 5x (duration reinterpreted from "policy steps" to "seconds of simulated time")? Is this intentional?
  2. Is duration: 20 an acceptable canonical fix to restore consistency with the documented RL-Zoo recipe, or is there a reason maintainers prefer to keep duration: 100 and instead update every dependent hyperparameter?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions