Episodes and resets during training #594

Balint-H · 2025-04-09T13:38:14Z

Balint-H
Apr 9, 2025

Hello, and many thanks for Brax!

I'm using Brax with MJX for motor control learning with RL. I'm trying to implement a variant of PPO, and I'm working my way through brax's version to understand it. My previous experience comes from non-accelerated/non-massively parallel versions of PPO. I was hoping I could get some additional clarity on how the experience collection is conducted. I'm particularly interested in how the episode management is handled.

In ppo.train, the only place where an environment reset seems to be called during training is once every time an epoch's worth of training steps (each involving multiple timesteps) takes place. I also don't see logic to handle when an environment step (or the next step) returns done (or truncated in the info). I'm not too surprised, as branching logic like that is to be avoided usually, if I understand from my limited Jax knowledge. However, it does lead to some questions how are environments that end before finishing an epoch are handled.

Does this mean environments continue to be stepped even if they have truncated/terminated before the epoch is over?
What if environments terminate early in the epoch, do they not contribute useful states to that epoch of training?
Does this mean the done field should always return true once the environment returned true? (e.g., imagine a function checking for "healthy" states, and wanting to terminate if an unhealthy state is reached. If the environment keeps being stepped after the episode should have ended, by chance the environment may return to a healthy state, which would be evaluated as valid, with a false value for done, meaning those transitions would once again influence learning.

Am I missing some obvious way environments are reset on their own within an epoch?

Answered by erikfrey

Apr 9, 2025

Hey Balint,

Answers to your questions:

Yes, vmap/scan means that environments continue to be stepped even after they are done. Brax handles this via an autoreset wrapper that reloads cached state after done == True:

https://github.com/google/brax/blob/main/brax/envs/wrappers/training.py#L151

So that means that in a big batch of num_envs env States, you may see different sim times, for example, if one of the envs in the batch terminated early.

Yes, riding off 1 they contribute useful state.

Think of state.info['first_pipeline_state'] as a cached pool of initial states. The cache size is num_envs - if you think this pool is too small of an initial set of states for RL to explore from, y…

View full answer

erikfrey · 2025-04-09T15:50:13Z

erikfrey
Apr 9, 2025
Maintainer

Hey Balint,

Answers to your questions:

Yes, vmap/scan means that environments continue to be stepped even after they are done. Brax handles this via an autoreset wrapper that reloads cached state after done == True:

https://github.com/google/brax/blob/main/brax/envs/wrappers/training.py#L151

So that means that in a big batch of num_envs env States, you may see different sim times, for example, if one of the envs in the batch terminated early.

Yes, riding off 1 they contribute useful state.

Think of state.info['first_pipeline_state'] as a cached pool of initial states. The cache size is num_envs - if you think this pool is too small of an initial set of states for RL to explore from, you can augment it with "hard" resets that actually update that cache pool by changing num_resets_per_eval:

https://github.com/google/brax/blob/main/brax/training/agents/ppo/train.py#L220

AutoResetWrapper ensures that we never have an unhealthy state.

Feel free to follow up with any other questions or close if this all checks out to you.

1 reply

Balint-H Apr 9, 2025
Author

Thanks a lot for the quick response. The environment wrappers for training were what I was missing, that clears it up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Episodes and resets during training #594

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Episodes and resets during training #594

Uh oh!

Uh oh!

Balint-H Apr 9, 2025

Replies: 1 comment · 1 reply

Uh oh!

erikfrey Apr 9, 2025 Maintainer

Uh oh!

Uh oh!

Balint-H Apr 9, 2025 Author

Balint-H
Apr 9, 2025

Replies: 1 comment 1 reply

erikfrey
Apr 9, 2025
Maintainer

Balint-H Apr 9, 2025
Author