[RFC] A fault management module for token-level recovering in rollout phase #4355

Li-Yongwen · 2025-11-29T09:57:04Z

Li-Yongwen
Nov 29, 2025

RFC_REQ

[RFC] A fault management module for token-level recovering in rollout phase

Motivation

In current RL training, the rollout phase typically accounts for 80% of the total training time. If a failure occurs during training, especially during the rollout phase which is often the stage with the highest probability of failure, we have to roll back to the last ckpt and manually resume training, which means we have to rollout all over again. However, if we could implement token-level rollout data saving and automatic training recovery, we would be able to resume rollout at the token-level point where it was interrupted, It will significantly reduce the losses caused by failures and avoid manually resume training.

Proposed Design

Design Overview

Use case

config	value
algorithm.adv_estimator	grpo
actor_rollout_ref.actor.strategy	megatron
actor_rollout_ref.rollout.name	vllm
actor_rollout_ref.rollout.mode	async
trainer.save_freq	1

Token-level rollout data saving and recovering:

We continuously save tokens to a ray queue during the actor rollout phase and parse them to a ray tokens_dict in the main process. Ray tokens_dict will automatically save its data to disk. After training resumes, we can load the data from tokens_dict or disk to continue actor rollout.

Automatic training recovery:

Code Example

Tokens Saving

# put tokens to tokens queue
from vllm.v1.engine.async_llm import AsyncLLM
    def _run_output_handler(self, tokens_queue=None):
        async def output_handler(q):
            while True:
                outputs = await engine_core.get_output_async()
                if q is not None:
                    req_info = {}
                    for output in outputs.outputs:
                        req_info[output.request_id] = {}
                        req_info[output.request_id]['new_tokens_ids'] = output.new_token_ids
                        req_info[output.request_id]['finished'] = output.finished
                        await q.put_async(req_info)
        self.output_handler = asyncio.create_task(output_handler(tokens_queue))

# get tokens from tokens queue
class FaultMgr:
    def catch_rollout_tokens(cls):
        @ray.remote(num_cpus=1)
        def run(q, td):
            while True:
                req_info = q.get()
                print(f"[fault manager] catch tokens {req_info}")
                if isinstance(req_info, tuple):
                    request_id, global_id = req_info
                    cls.request_global_id_map[request_id] = global_id
                    elif isinstance(req_info, dict):
                        cls._parse_req_tokens(req_info, td)
        run.remote(cls.tokens_queue, cls.tokens_dict)

Note

For hardware failures, it can also implement automatic fault recovery and node isolation through the deployment of Kubernetes combined with Volcano.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] A fault management module for token-level recovering in rollout phase #4355

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC] A fault management module for token-level recovering in rollout phase #4355

Uh oh!

Uh oh!

Li-Yongwen Nov 29, 2025

RFC_REQ

Motivation

Proposed Design

Design Overview

Code Example

Tokens Saving

Note

Replies: 0 comments

Li-Yongwen
Nov 29, 2025