[RFC]Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems #4381

jingsiyu111 · 2025-12-02T08:43:05Z

jingsiyu111
Dec 2, 2025

Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems

Motivation

To effectively address the prevalent long-tail problem in mainstream post-training scenarios, VERL proposes a One-Step Off-Policy strategy. This approach asynchronously generates samples required for the next training iteration during model training, enabling efficient synergy between training and data generation.

However, distributed cluster training remains susceptible to software or hardware failures, which can lead to abrupt training interruptions and undermine the overall stability and reliability of the system. To enhance fault tolerance in large-scale training environments, this RFC proposes a comprehensive fault recovery framework tailored for reinforcement learning systems under a train-inference separation architecture. The solution enables rapid rescheduling and high-efficiency recovery after failures, ensuring continuous and robust training performance.

The figure above illustrates our approach to rapidly restart the Rollout Worker Group when it fails during normal training. By leveraging weight synchronization and other key steps, the system ensures seamless resumption of training, maintaining workflow continuity and overall system reliability.

Proposed Design

Design Overview

Use case

config	value
algorithm.adv_estimator	grpo
actor_rollout_ref.actor.strategy	fsdp
actor_rollout_ref.rollout.name	vllm
actor_rollout_ref.rollout.mode	sync

re-scheduling process

During the re-scheduling process of the Worker Group, fault detection serves as a critical prerequisite for ensuring system reliability. To this end, the system must perform real-time monitoring of the operational status for both synchronous and asynchronous tasks, with a focus on identifying the following two types of anomalies:

Abnormality Detection: Primarily aimed at identifying abnormal exits or operational failures caused by software issues. This includes, but is not limited to: memory overflow (OOM), network connection interruptions or packet loss, process crashes, or forced termination.
Timeout Detection: To detect potential blockages or response delays during task execution, the system employs configurable timeout thresholds to monitor the following conditions: inter-node communication timeouts, and task processes exhibiting prolonged unresponsiveness (i.e., "hanging" or "stuck").

Furthermore, the system supports dynamic monitoring of fault recovery retry attempts. Users can flexibly configure the maximum number of retries based on specific scenarios, thereby preventing infinite retry loops that could lead to unnecessary resource consumption.

When a fault is detected, the system automatically triggers the re-scheduling process:

Perform resource cleanup for the faulty Worker Group;
Rebuild a new Worker Group on available nodes, seamlessly resuming the training task;
Leave non-faulty Worker Groups unaffected, allowing them to continue running normally, thus ensuring the continuity and efficiency of the overall training workflow.

This mechanism significantly enhances the system’s ability to recover from transient or localized failures, greatly improving the overall stability and resilience of the distributed training environment.

We implemented fault recovery and reconstruction management by introducing a new FaultMgr class. The class diagram is shown below:

Interface Name	Explanation
update_retry_options	Enable retry functionality for Ray Worker Group tasks
rebuild_resource_pool	Clean up and rebuild the resource pool
sync_weight	Synchronize weights between inference and training
catch_actor_fault catch_rollout_fault catch_reward_fault	Detect abnormalities, timeouts, and max_retries in worker group tasks, and rebuild the worker group.


max_reshedule	Use a decorator to control the rescheduling of the entire training task.
timeout	Customize timeout thresholds to detect timeout failures such as inter-node communication timeouts and prolonged unresponsiveness of task processes.

The overall workflow for fault detection and rescheduling is illustrated in the figure below.

When a Worker Group failure is detected, the system first attempts to leverage Ray’s native automatic recovery mechanism to rapidly restore the failed group. If successful reconstruction cannot be achieved within the predefined maximum number of retries (max_task_retries), the system triggers the proactive recovery process managed by the FaultMgr, entering a more fine-grained fault manager phase.

The specific recovery strategy is dynamically selected based on the fault scenario:

If both the Actor Group and the Rollout Worker Group fail simultaneously, and the total number of rescheduling attempts has not yet reached the upper limit (max_reschedule_times), the system directly executes the max_reschedule strategy—restarting the entire post-training workflow.
If only a single Worker Group fails, and the number of retries for that group has not exceeded its independent maximum rebuild limit (e.g., actor_max_rebuild_times), the system performs a targeted rebuild of only the faulty group. The failed group’s tasks are restarted while all other healthy groups continue running uninterrupted—achieving localized recovery without disrupting the overall workflow.

Through this hierarchical response and on-demand scheduling mechanism, the system achieves high recovery efficiency while maintaining optimal resource utilization and task consistency. This significantly enhances fault tolerance and system stability in large-scale distributed training environments.

Code Example

update_retry_options

Call the update_retry_options interface of FaultMgr to configure Ray’s task-level automatic rescheduling parameters, enabling Ray’s native task rescheduling capability. Ray will then restart tasks based on their original task parameters at the time of failure.

    @classmethod
    def update_retry_options(cls,ray_cls):
        old_init = ray_cls.__init__
        
        @wraps(old_init)
        def new_init(self, *args, **kwargs):
            additional_resource = {}
            old_init(self, *args, **kwargs)
            if cls.trainer.config:
                enable_retry = cls.trainer.config.fault_manager.enable_retry
                max_restarts = cls.trainer.config.fault_manager.max_restarts
                max_task_retries = cls.trainer.config.fault_manager.max_task_retries
                if enable_retry:
                    additional_resource = {
                            "max_restarts": max_restarts,
                            "max_task_retries": max_task_retries,
                        }
            self.update_options(additional_resource)
        ray_cls.__init__ = new_init

rebuild_worker_group

Rebuild via rebuild_worker_group. For detailed implementations of rebuild_resource_pool and sync_weight, please refer to the PR.

    @classmethod
    def rebuild_worker_group(cls, role):
        cls.rebuild_resourse_pool(role)
        cls.sync_weight(role)

rebuild_resourse_pool: An Actor Example
When rebuilding the resource pool, first call the following function to complete the cleanup.

def release_placement_groups(resource_pool):
    # "Release all placement groups in the resource pool"
    if resource_pool.pgs is None:
        return
    for pg in resource_pool.pgs:
        try:
            ray.util.remove_placement_group(pg)
        except Exception as e:
            print(f"Error releasing placement group {pg}: {e}")
    resource_pool.pgs = None

After completing the cleanup, rebuild the resource pool according to role types.

    def rebuild_resourse_pool(cls, role):
        from verl.single_controller.ray import RayClassWithInitArgs
        from verl.single_controller.ray.base import create_colocated_worker_cls

        resource_pool = cls.trainer.resource_pool_manager.get_resource_pool(role)
        release_placement_groups(resource_pool)

        del cls.trainer.actor_wg
        del cls.trainer.ref_policy_wg
        # get resource pool
        resource_pool = cls.trainer.resource_pool_manager.get_resource_pool(Role.Actor)

        actor_cls = RayClassWithInitArgs(
            cls=cls.trainer.role_worker_mapping[Role.Actor],
            config=cls.trainer.config.actor_rollout_ref,
            role=str(Role.Actor)
        )

        ref_cls = RayClassWithInitArgs(
            cls=cls.trainer.role_worker_mapping[Role.RefPolicy],
            config=cls.trainer.config.actor_rollout_ref,
            role=str(Role.RefPolicy)
        )

        class_dict = {
            str(Role.Actor): actor_cls,
            str(Role.RefPolicy): ref_cls,
        }

        # update resource pool to cls dict
        if resource_pool not in cls.trainer.resource_pool_to_cls:
            cls.trainer.resource_pool_to_cls[resource_pool] = {}
        cls.trainer.resource_pool_to_cls[resource_pool].update(class_dict)

        # create worker dict cls and wg
        worker_dict_cls = create_colocated_worker_cls(class_dict)
        wg_dict = cls.trainer.ray_worker_group_cls(
            resource_pool=resource_pool,
            ray_cls_with_init=worker_dict_cls,
            device_name=cls.trainer.device_name,
        )
        spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
        cls.trainer.actor_wg = spawn_wg[str(Role.Actor)]
        cls.trainer.ref_policy_wg = spawn_wg[str(Role.RefPolicy)]

sync_weight: An Actor Example
After completing the resource pool reconstruction, dynamically fetch weights from the normally running groups, as implemented below:

def sync_weight(cls, role):
    def create_weight_sync_group():
        master_address = ray.get(cls.trainer.actor_wg.workers[0]._get_node_ip.remote())
        master_port = ray.get(cls.trainer.actor_wg.workers[0]._get_free_port.remote())
        world_size = len(cls.trainer.actor_wg.workers + cls.trainer.rollout_wg.workers)
        cls.trainer.actor_wg.create_weight_sync_group(
            master_address,
            master_port,
            0,
            world_size,
        )
        ray.get(
            cls.trainer.rollout_wg.create_weight_sync_group(
                master_address,
                master_port,
                len(cls.trainer.actor_wg.workers),
                world_size,
            )
        )

    cls.trainer.actor_wg.init_model()
    cls.trainer.ref_policy_wg.init_model()
    weights_info = cls.trainer.rollout_wg.get_actor_weights_info()[0]
    cls.trainer.actor_wg.set_actor_weights_info(weights_info)
    create_weight_sync_group()

Failure Handling: An Actor Example

    @classmethod
    def catch_actor_fault(cls, batch):
        actor_max_rebuild_times = cls.trainer.config.fault_manager.actor_max_rebuild_times
        enable_retry = cls.trainer.config.fault_manager.enable_retry
        try:
            old_log_prob = cls.trainer.actor_wg.compute_log_prob(batch)
            return old_log_prob
        except Exception as e:
            if enable_retry:
                if cls.rollout_fault:
                    raise e
                for attempt in range(actor_max_rebuild_times):
                    try:
                        start_time = time.time()
                        cls.rebuild_worker_group(Role.Actor)
                        end_time = time.time()
                        old_log_prob = cls.trainer.actor_wg.compute_log_prob(batch)
                        return old_log_prob

                    except Exception as e:
                        if attempt == actor_max_rebuild_times - 1:
                            cls.actor_fault = True
                            raise RuntimeError(
                                f"Failed to recover actor worker after {actor_max_rebuild_times} attempts.") from e
            else:
                raise e

Code Implementation for Function Timeout Detection

def monitor():
    start_time = time.time()
    while not stop_flag.is_set():
        pid = os.getpid()
        chip_find_cmd = f"npu-smi info | grep '{pid}'"
        text = sys_command(chip_find_cmd)
        match = re.search(r'\|\s*(\d+)\s+(\d+)\s*\|', text)
        device_id, chip_id = match.groups()
        aicore_cmd = f"npu-smi info -i {device_id} -c {chip_id} -t usages | grep 'Aicore'"
        res = sys_command(aicore_cmd).split(':')[1]
        usage = int(res.strip())

        if usage == 0:
            if not start_time:
                start_time = time.time()
            elif time.time() - start_time > seconds:
                ai_core_flag['flag'] = True
                break
        else:
            start_time = None
        time.sleep(1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC]Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems #4381

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[RFC]Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems #4381

Uh oh!

Uh oh!

jingsiyu111 Dec 2, 2025

Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems

Motivation

Proposed Design

Design Overview

Code Example

update_retry_options

rebuild_worker_group

Failure Handling: An Actor Example

Code Implementation for Function Timeout Detection

Replies: 0 comments

jingsiyu111
Dec 2, 2025