[RFC] A fault management module for token-level recovering in rollout phase #4355
Li-Yongwen
started this conversation in
RFC
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC_REQ
[RFC] A fault management module for token-level recovering in rollout phase
Motivation
In current RL training, the rollout phase typically accounts for 80% of the total training time. If a failure occurs during training, especially during the rollout phase which is often the stage with the highest probability of failure, we have to roll back to the last ckpt and manually resume training, which means we have to rollout all over again. However, if we could implement token-level rollout data saving and automatic training recovery, we would be able to resume rollout at the token-level point where it was interrupted, It will significantly reduce the losses caused by failures and avoid manually resume training.
Proposed Design
Design Overview
We continuously save tokens to a ray queue during the actor rollout phase and parse them to a ray tokens_dict in the main process. Ray tokens_dict will automatically save its data to disk. After training resumes, we can load the data from tokens_dict or disk to continue actor rollout.
Code Example
Tokens Saving
Note
For hardware failures, it can also implement automatic fault recovery and node isolation through the deployment of Kubernetes combined with Volcano.
Beta Was this translation helpful? Give feedback.
All reactions