[RFC]: Support DP of pipeline stage

### Motivation.

vLLM-Omni is designed for multi-modality model inference using a disaggregated pipeline approach. Currently, the system faces challenges in scalable deployment:

- **Bottlenecks**: A single API server or single stage instance can become a bottleneck under high load.
- **Resource Utilization**: Different stages (e.g., AR Decoder vs. Diffusion) have vastly different computational requirements. A one-to-one mapping often leads to resource waste or insufficiency.
- **Scalability**: To support production-grade environments, the system must handle multiple instances of specific stages (Data Parallelism) distributed across single or multiple nodes.

### Proposed Change.

The proposed solution introduces a distributed scheduling and routing mechanism that supports multiple instances (DP ranks) for each stage in the Omni Pipeline.

**1. Distributed Deployment Structure**
The architecture supports deploying multiple instances of `OmniStage `across nodes.

- **API Servers**: Multiple API servers can be spawned behind a shared socket (using SO_REUSEPORT or similar mechanism) to handle incoming HTTP requests concurrently.
- **Stage Instances**: Each logical stage (e.g., Stage 0: Thinker, Stage 1: Diffusion) can have multiple physical workers (DP ranks). For example, Stage 0 might have 2 instances (DP rank 0, 1) and Stage 1 might have 4 instances across different GPUs/Nodes.
- **Coordinator**: An `Omni DP Coordinator` maintains the global state, collects metrics, and coordinates request waves if Mixture-of-Experts (MoE) or similar complex routing is required.

**2. Scheduling and Routing**
To effectively utilize the DP instances, the system requires a routing mechanism(potentially evolving into a `Omni Pipeline` module):

- **Task Dispatch**: Incoming requests at the API Server/AsyncOmni layer need to be dispatched to specific stage instances.
- **Load Balancing**: The router or coordination layer will select instances based on load balancing algorithms (e.g., Round Robin, Random).

**3. Command Line Interface (CLI) Design**
Deployment should be simplified using a CLI that extends vLLM's standard arguments while adding Omni-specific flags for stage management.
**Example Single Node Command:**
```bash
# Start API server and AR stage (Stage 0) with DP=2
CUDA_VISIBLE_DEVICES=0-3 vllm serve <model> --omni \
  --api-server-count 4 \
  --stage-id 0 \
  --model-stage thinker \
  --data-parallel-size 2 \
  --tensor-parallel-size 2

# Start Diffusion stage (Stage 1) with DP=2
CUDA_VISIBLE_DEVICES=4,5 vllm serve <model> --omni \
  --headless \
  --stage-id 1 \
  --model-stage diffusion \
  --data-parallel-size 2
```

**Example Single Node Deployment Diagram**

<img width="3244" height="1528" alt="Image" src="https://github.com/user-attachments/assets/490f6dad-17ee-4dc9-994c-6638d72430d3" />

**Example Multi-Node Command**: This involves specifying data-parallel-address, data-parallel-rpc-port, and data-parallel-start-rank to coordinate instances across nodes.

**Example Multi-Node Deployment Diagram**
<img width="3324" height="2324" alt="Image" src="https://github.com/user-attachments/assets/72e89ce5-7a79-4686-956c-45efec00b545" />

**Implementation Plan**

- [ ] **Stage-based Independent Deployment**: Implement CLI parameters to start individual stages independently.
    - [ ] Single Node, Multiprocessing. #939
    - [ ] Single Node, Multiprocessing, with Unix Domain Socket.
    - [ ] Multiple Node, Multiprocessing.
    - [ ] Ray distributed runtime.
- [ ] **Stage-wise Data Parallelism(Multiple Instance)**: Use multiple instances for each stage to maximize overall throughput. @chickeyton 
- [ ] **Configuration Simplification**: Only model's pipeline related configurations are allowed to be added. The configuration is supposed to not be changed by users. @lishunyang12 
    - [ ] Configure the model pipeline for each model.
    - [ ] Use dataclass definition instead of `kwargs`.
    - [ ] Restrict the use of `OmegaConf` to `yaml_util.py`, which is responsible for parsing the YAML config file and translating configs to the corresponding dataclass.
- [ ] **Scheduling Algorithm**: Implement the logic to distribute tasks among multiple instances of a single stage. @chickeyton 
- [ ] **Pipeline Construction**: Add OmniPipeline to construct the computation flow.
- [ ] Todo：RAS

Future Todos

- [ ] External Pipeline orchestration

Reference [doc](https://docs.google.com/document/d/1sIfHE0uYHWxO3WOgxjfasRiqpWwCWUdtK-yyrgEtK1c/edit?usp=sharing).


### Feedback Period.

_No response_

### CC List.

_No response_

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://vllm-omni.readthedocs.io), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support DP of pipeline stage #870

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC]: Support DP of pipeline stage #870

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions