[RFC]Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems #4381
jingsiyu111
started this conversation in
RFC
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Fault Management Module for Worker Group Recovery in Training-Inference Separation Systems
Motivation
To effectively address the prevalent long-tail problem in mainstream post-training scenarios, VERL proposes a One-Step Off-Policy strategy. This approach asynchronously generates samples required for the next training iteration during model training, enabling efficient synergy between training and data generation.
However, distributed cluster training remains susceptible to software or hardware failures, which can lead to abrupt training interruptions and undermine the overall stability and reliability of the system. To enhance fault tolerance in large-scale training environments, this RFC proposes a comprehensive fault recovery framework tailored for reinforcement learning systems under a train-inference separation architecture. The solution enables rapid rescheduling and high-efficiency recovery after failures, ensuring continuous and robust training performance.
The figure above illustrates our approach to rapidly restart the Rollout Worker Group when it fails during normal training. By leveraging weight synchronization and other key steps, the system ensures seamless resumption of training, maintaining workflow continuity and overall system reliability.
Proposed Design
Design Overview
During the re-scheduling process of the Worker Group, fault detection serves as a critical prerequisite for ensuring system reliability. To this end, the system must perform real-time monitoring of the operational status for both synchronous and asynchronous tasks, with a focus on identifying the following two types of anomalies:
Furthermore, the system supports dynamic monitoring of fault recovery retry attempts. Users can flexibly configure the maximum number of retries based on specific scenarios, thereby preventing infinite retry loops that could lead to unnecessary resource consumption.
When a fault is detected, the system automatically triggers the re-scheduling process:
This mechanism significantly enhances the system’s ability to recover from transient or localized failures, greatly improving the overall stability and resilience of the distributed training environment.
We implemented fault recovery and reconstruction management by introducing a new FaultMgr class. The class diagram is shown below:
catch_rollout_fault
catch_reward_fault
When a Worker Group failure is detected, the system first attempts to leverage Ray’s native automatic recovery mechanism to rapidly restore the failed group. If successful reconstruction cannot be achieved within the predefined maximum number of retries (max_task_retries), the system triggers the proactive recovery process managed by the FaultMgr, entering a more fine-grained fault manager phase.
The specific recovery strategy is dynamically selected based on the fault scenario:
Through this hierarchical response and on-demand scheduling mechanism, the system achieves high recovery efficiency while maintaining optimal resource utilization and task consistency. This significantly enhances fault tolerance and system stability in large-scale distributed training environments.
Code Example
update_retry_options
Call the update_retry_options interface of FaultMgr to configure Ray’s task-level automatic rescheduling parameters, enabling Ray’s native task rescheduling capability. Ray will then restart tasks based on their original task parameters at the time of failure.
rebuild_worker_group
Rebuild via rebuild_worker_group. For detailed implementations of rebuild_resource_pool and sync_weight, please refer to the PR.
When rebuilding the resource pool, first call the following function to complete the cleanup.
After completing the cleanup, rebuild the resource pool according to role types.
After completing the resource pool reconstruction, dynamically fetch weights from the normally running groups, as implemented below:
Failure Handling: An Actor Example
Code Implementation for Function Timeout Detection
Beta Was this translation helpful? Give feedback.
All reactions