-
Notifications
You must be signed in to change notification settings - Fork 395
Description
Motivation.
vLLM-Omni is designed for multi-modality model inference using a disaggregated pipeline approach. Currently, the system faces challenges in scalable deployment:
- Bottlenecks: A single API server or single stage instance can become a bottleneck under high load.
- Resource Utilization: Different stages (e.g., AR Decoder vs. Diffusion) have vastly different computational requirements. A one-to-one mapping often leads to resource waste or insufficiency.
- Scalability: To support production-grade environments, the system must handle multiple instances of specific stages (Data Parallelism) distributed across single or multiple nodes.
Proposed Change.
The proposed solution introduces a distributed scheduling and routing mechanism that supports multiple instances (DP ranks) for each stage in the Omni Pipeline.
1. Distributed Deployment Structure
The architecture supports deploying multiple instances of OmniStage across nodes.
- API Servers: Multiple API servers can be spawned behind a shared socket (using SO_REUSEPORT or similar mechanism) to handle incoming HTTP requests concurrently.
- Stage Instances: Each logical stage (e.g., Stage 0: Thinker, Stage 1: Diffusion) can have multiple physical workers (DP ranks). For example, Stage 0 might have 2 instances (DP rank 0, 1) and Stage 1 might have 4 instances across different GPUs/Nodes.
- Coordinator: An
Omni DP Coordinatormaintains the global state, collects metrics, and coordinates request waves if Mixture-of-Experts (MoE) or similar complex routing is required.
2. Scheduling and Routing
To effectively utilize the DP instances, the system requires a routing mechanism(potentially evolving into a Omni Pipeline module):
- Task Dispatch: Incoming requests at the API Server/AsyncOmni layer need to be dispatched to specific stage instances.
- Load Balancing: The router or coordination layer will select instances based on load balancing algorithms (e.g., Round Robin, Random).
3. Command Line Interface (CLI) Design
Deployment should be simplified using a CLI that extends vLLM's standard arguments while adding Omni-specific flags for stage management.
Example Single Node Command:
# Start API server and AR stage (Stage 0) with DP=2
CUDA_VISIBLE_DEVICES=0-3 vllm serve <model> --omni \
--api-server-count 4 \
--stage-id 0 \
--model-stage thinker \
--data-parallel-size 2 \
--tensor-parallel-size 2
# Start Diffusion stage (Stage 1) with DP=2
CUDA_VISIBLE_DEVICES=4,5 vllm serve <model> --omni \
--headless \
--stage-id 1 \
--model-stage diffusion \
--data-parallel-size 2Example Single Node Deployment Diagram
Example Multi-Node Command: This involves specifying data-parallel-address, data-parallel-rpc-port, and data-parallel-start-rank to coordinate instances across nodes.
Example Multi-Node Deployment Diagram

Implementation Plan
- Stage-based Independent Deployment: Implement CLI parameters to start individual stages independently.
- Single Node, Multiprocessing. [Feature] Support Stage Based Deployment CLI #939
- Single Node, Multiprocessing, with Unix Domain Socket.
- Multiple Node, Multiprocessing.
- Ray distributed runtime.
- Stage-wise Data Parallelism(Multiple Instance): Use multiple instances for each stage to maximize overall throughput. @chickeyton
- Configuration Simplification: Only model's pipeline related configurations are allowed to be added. The configuration is supposed to not be changed by users. @lishunyang12
- Configure the model pipeline for each model.
- Use dataclass definition instead of
kwargs. - Restrict the use of
OmegaConftoyaml_util.py, which is responsible for parsing the YAML config file and translating configs to the corresponding dataclass.
- Scheduling Algorithm: Implement the logic to distribute tasks among multiple instances of a single stage. @chickeyton
- Pipeline Construction: Add OmniPipeline to construct the computation flow.
- Todo:RAS
Future Todos
- External Pipeline orchestration
Reference doc.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.