parking-v0: `duration` semantics yield 500-step episodes — RL-Zoo / SB3 default hyperparameters silently break

### Summary
In recent highway-env releases (>=1.6, after the `self.steps` -> sub-step counting change), `parking-v0` episodes effectively run **500 policy steps** instead of the **100** the canonical RL-Zoo / SB3 recipes were tuned for.

### Where the 5x comes from

Default `parking-v0` config:
- `duration: 100`
- `policy_frequency: 5`
- truncation: `_is_truncated -> self.time >= duration`

`self.time` advances by `1/policy_frequency = 0.2` per policy step, so:

`steps_per_episode = duration / (1/policy_frequency) = 100 / 0.2 = 500`

SB3 / SBX still treat one policy step as `1/policy_frequency` simulated seconds, so from their side an episode now runs 500 timesteps, but the published RL-Zoo recipes ([`rl-baselines3-zoo/hyperparams`](https://github.com/DLR-RM/rl-baselines3-zoo/tree/master/hyperparams)) for `parking-v0` still assume 100:

```yaml
parking-v0:
  learning_starts: 100
  replay_buffer_class: HerReplayBuffer
  replay_buffer_kwargs:
    max_episode_length: 100
```

So out-of-the-box on modern highway-env:
- `learning_starts (100) << actual_episode_length (500)`
- HER `max_episode_length (100) << actual_episode_length (500)`

This produces e.g. `RuntimeError: Unable to sample before the end of the first episode.` — exactly the report in [DLR-RM/rl-baselines3-zoo#433](https://github.com/DLR-RM/rl-baselines3-zoo/issues/433), where the maintainer's suggestion was a per-user workaround (*raise `learning_starts` above the maximum env timesteps*), not a fix to the underlying mismatch.

### What we'd like to understand

Setting `duration: 20` locally restores 100-step episodes (`20 / 0.2 = 100`) and every published RL-Zoo / SB3 hyperparameter (`learning_starts=100`, HER `max_episode_length=100`, `n_timesteps=1e5`) becomes consistent again without further changes.

**Two concrete questions:**

1. **Why was the episode horizon effectively increased 5x (`duration` reinterpreted from "policy steps" to "seconds of simulated time")?** Is this intentional? 
2. **Is `duration: 20` an acceptable canonical fix** to restore consistency with the documented RL-Zoo recipe, or is there a reason maintainers prefer to keep `duration: 100` and instead update every dependent hyperparameter?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parking-v0: `duration` semantics yield 500-step episodes — RL-Zoo / SB3 default hyperparameters silently break #674

Summary

Where the 5x comes from

What we'd like to understand

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

parking-v0: duration semantics yield 500-step episodes — RL-Zoo / SB3 default hyperparameters silently break #674

Description

Summary

Where the 5x comes from

What we'd like to understand

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

parking-v0: `duration` semantics yield 500-step episodes — RL-Zoo / SB3 default hyperparameters silently break #674