-
Hello, and many thanks for Brax! I'm using Brax with MJX for motor control learning with RL. I'm trying to implement a variant of PPO, and I'm working my way through brax's version to understand it. My previous experience comes from non-accelerated/non-massively parallel versions of PPO. I was hoping I could get some additional clarity on how the experience collection is conducted. I'm particularly interested in how the episode management is handled. In
Am I missing some obvious way environments are reset on their own within an epoch? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hey Balint, Answers to your questions:
https://github.com/google/brax/blob/main/brax/envs/wrappers/training.py#L151 So that means that in a big batch of
Think of https://github.com/google/brax/blob/main/brax/training/agents/ppo/train.py#L220
Feel free to follow up with any other questions or close if this all checks out to you. |
Beta Was this translation helpful? Give feedback.
Hey Balint,
Answers to your questions:
done == True
:https://github.com/google/brax/blob/main/brax/envs/wrappers/training.py#L151
So that means that in a big batch of
num_envs
env States, you may see different sim times, for example, if one of the envs in the batch terminated early.Think of
state.info['first_pipeline_state']
as a cached pool of initial states. The cache size isnum_envs
- if you think this pool is too small of an initial set of states for RL to explore from, y…