Skip to content
Open
121 changes: 121 additions & 0 deletions examples/online_serving/text_to_video/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Text-To-Video

This example demonstrates how to deploy the Wan2.2 text-to-video model for online video generation using vLLM-Omni.

## Start Server

### Basic Start

```bash
vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091
```

### Start with Parameters

Or use the startup script:

```bash
bash run_server.sh
```

The script allows overriding:
- `MODEL` (default: `Wan-AI/Wan2.2-T2V-A14B-Diffusers`)
- `PORT` (default: `8091`)
- `BOUNDARY_RATIO` (default: `0.875`)
- `FLOW_SHIFT` (default: `5.0`)
- `CACHE_BACKEND` (default: `none`)
- `ENABLE_CACHE_DIT_SUMMARY` (default: `0`)

## API Calls

### Method 1: Using curl

```bash
# Basic text-to-video generation
bash run_curl_text_to_video.sh

# Or execute directly
curl -s http://localhost:8091/v1/videos/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
"negative_prompt": "色调艳丽 ,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走",
"height": 480,
"width": 832,
"num_frames": 33,
"fps": 16,
"num_inference_steps": 40,
"guidance_scale": 4.0,
"guidance_scale_2": 4.0,
"boundary_ratio": 0.875,
"seed": 42
}' | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4
```

## Request Format

### Simple Text Generation

```json
{
"prompt": "A cinematic view of a futuristic city at sunset"
}
```

### Generation with Parameters

```json
{
"prompt": "A cinematic view of a futuristic city at sunset",
"negative_prompt": "low quality, blurry, static",
"width": 832,
"height": 480,
"num_frames": 33,
"fps": 16,
"num_inference_steps": 40,
"guidance_scale": 4.0,
"guidance_scale_2": 4.0,
"boundary_ratio": 0.875,
"flow_shift": 5.0,
"seed": 42
}
```

## Generation Parameters

| Parameter | Type | Default | Description |
| --------------------- | ------ | ------- | ------------------------------------------------ |
| `prompt` | str | - | Text description of the desired video |
| `negative_prompt` | str | None | Negative prompt |
| `n` | int | 1 | Number of videos to generate |
| `size` | str | None | Video size, e.g. `"832x480"` |
| `width` | int | None | Video width in pixels |
| `height` | int | None | Video height in pixels |
| `num_frames` | int | None | Number of frames to generate |
| `fps` | int | None | Frames per second for output video |
| `num_inference_steps` | int | None | Number of denoising steps |
| `guidance_scale` | float | None | CFG guidance scale (low-noise stage) |
| `guidance_scale_2` | float | None | CFG guidance scale (high-noise stage, Wan2.2) |
| `boundary_ratio` | float | None | Boundary split ratio for low/high DiT (Wan2.2) |
| `flow_shift` | float | None | Scheduler flow shift (Wan2.2) |
| `seed` | int | None | Random seed (reproducible) |
| `lora` | object | None | LoRA configuration |
| `extra_body` | object | None | Model-specific extra parameters |

## Response Format

```json
{
"created": 1234567890,
"data": [
{ "b64_json": "<base64-mp4>" }
]
}
```

## Extract Video

```bash
# Extract base64 from response and decode to video
cat response.json | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4
```
23 changes: 23 additions & 0 deletions examples/online_serving/text_to_video/run_curl_text_to_video.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
# Wan2.2 text-to-video curl example

OUTPUT_PATH="wan22_output.mp4"

curl -X POST http://localhost:8091/v1/videos/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
"negative_prompt": "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走",
"height": 480,
"width": 832,
"num_frames": 33,
"fps": 16,
"num_inference_steps": 40,
"guidance_scale": 4.0,
"guidance_scale_2": 4.0,
"boundary_ratio": 0.875,
"flow_shift": 5.0,
"seed": 42
}' | jq -r '.data[0].b64_json' | base64 -d > "${OUTPUT_PATH}"

echo "Saved video to ${OUTPUT_PATH}"
31 changes: 31 additions & 0 deletions examples/online_serving/text_to_video/run_server.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#!/bin/bash
# Wan2.2 online serving startup script

MODEL="${MODEL:-Wan-AI/Wan2.2-T2V-A14B-Diffusers}"
PORT="${PORT:-8091}"
BOUNDARY_RATIO="${BOUNDARY_RATIO:-0.875}"
FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
CACHE_BACKEND="${CACHE_BACKEND:-none}"
ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"

echo "Starting Wan2.2 server..."
echo "Model: $MODEL"
echo "Port: $PORT"
echo "Boundary ratio: $BOUNDARY_RATIO"
echo "Flow shift: $FLOW_SHIFT"
echo "Cache backend: $CACHE_BACKEND"
if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
echo "Cache-DiT summary: enabled"
fi

CACHE_BACKEND_FLAG=""
if [ "$CACHE_BACKEND" != "none" ]; then
CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
fi

vllm serve "$MODEL" --omni \
--port "$PORT" \
--boundary-ratio "$BOUNDARY_RATIO" \
--flow-shift "$FLOW_SHIFT" \
$CACHE_BACKEND_FLAG \
$(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)
13 changes: 9 additions & 4 deletions vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,6 +358,9 @@ def forward(
self._guidance_scale = guidance_low
self._guidance_scale_2 = guidance_high

# Prefer engine-configured boundary_ratio, but allow per-request fallback.
boundary_ratio = self.boundary_ratio if self.boundary_ratio is not None else req.sampling_params.boundary_ratio

# validate shapes
self.check_inputs(
prompt=prompt,
Expand All @@ -366,7 +369,8 @@ def forward(
width=width,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_prompt_embeds,
guidance_scale_2=guidance_high if self.boundary_ratio is not None else None,
guidance_scale_2=guidance_high if boundary_ratio is not None else None,
boundary_ratio=boundary_ratio,
)

if num_frames % self.vae_scale_factor_temporal != 1:
Expand Down Expand Up @@ -407,8 +411,8 @@ def forward(
timesteps = self.scheduler.timesteps
self._num_timesteps = len(timesteps)
boundary_timestep = None
if self.boundary_ratio is not None:
boundary_timestep = self.boundary_ratio * self.scheduler.config.num_train_timesteps
if boundary_ratio is not None:
boundary_timestep = boundary_ratio * self.scheduler.config.num_train_timesteps

# Handle I2V mode when expand_timesteps=True and image is provided
multi_modal_data = req.prompts[0].get("multi_modal_data", {}) if not isinstance(req.prompts[0], str) else None
Expand Down Expand Up @@ -695,6 +699,7 @@ def check_inputs(
prompt_embeds=None,
negative_prompt_embeds=None,
guidance_scale_2=None,
boundary_ratio=None,
):
if height % 16 != 0 or width % 16 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")
Expand All @@ -721,5 +726,5 @@ def check_inputs(
):
raise ValueError(f"`negative_prompt` has to be of type `str` or `list` but is {type(negative_prompt)}")

if self.boundary_ratio is None and guidance_scale_2 is not None:
if boundary_ratio is None and guidance_scale_2 is not None:
raise ValueError("`guidance_scale_2` is only supported when `boundary_ratio` is set.")
51 changes: 51 additions & 0 deletions vllm_omni/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,13 @@
ImageGenerationRequest,
ImageGenerationResponse,
)
from vllm_omni.entrypoints.openai.protocol.videos import (
VideoGenerationRequest,
VideoGenerationResponse,
)
from vllm_omni.entrypoints.openai.serving_chat import OmniOpenAIServingChat
from vllm_omni.entrypoints.openai.serving_speech import OmniOpenAIServingSpeech
from vllm_omni.entrypoints.openai.serving_video import OmniOpenAIServingVideo
from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniSamplingParams, OmniTextPrompt
from vllm_omni.lora.request import LoRARequest
from vllm_omni.lora.utils import stable_lora_int_id
Expand Down Expand Up @@ -373,6 +378,12 @@ async def omni_init_app_state(
diffusion_engine=engine_client, # type: ignore
model_name=model_name,
)
diffusion_stage_configs = engine_client.stage_configs if hasattr(engine_client, "stage_configs") else None
state.openai_serving_video = OmniOpenAIServingVideo.for_diffusion(
diffusion_engine=engine_client, # type: ignore
model_name=model_name,
stage_configs=diffusion_stage_configs,
)

state.enable_server_load_tracking = getattr(args, "enable_server_load_tracking", False)
state.server_load_metrics = 0
Expand Down Expand Up @@ -655,6 +666,11 @@ async def omni_init_app_state(
state.openai_serving_speech = OmniOpenAIServingSpeech(
engine_client, state.openai_serving_models, request_logger=request_logger
)
state.openai_serving_video = OmniOpenAIServingVideo(
engine_client,
model_name=served_model_names[0] if served_model_names else None,
stage_configs=state.stage_configs,
)

state.enable_server_load_tracking = args.enable_server_load_tracking
state.server_load_metrics = 0
Expand All @@ -668,6 +684,10 @@ def Omnispeech(request: Request) -> OmniOpenAIServingSpeech | None:
return request.app.state.openai_serving_speech


def Omnivideo(request: Request) -> OmniOpenAIServingVideo | None:
return request.app.state.openai_serving_video


@router.post(
"/v1/chat/completions",
dependencies=[Depends(validate_json_request)],
Expand Down Expand Up @@ -1065,3 +1085,34 @@ async def generate_images(request: ImageGenerationRequest, raw_request: Request)
raise HTTPException(
status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value, detail=f"Image generation failed: {str(e)}"
)


@router.post(
"/v1/videos/generations",
dependencies=[Depends(validate_json_request)],
responses={
HTTPStatus.OK.value: {"model": VideoGenerationResponse},
HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse},
HTTPStatus.SERVICE_UNAVAILABLE.value: {"model": ErrorResponse},
HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse},
},
)
async def generate_videos(request: VideoGenerationRequest, raw_request: Request) -> VideoGenerationResponse:
"""Generate videos from text prompts using diffusion models."""
handler = Omnivideo(raw_request)
if handler is None:
raise HTTPException(
status_code=HTTPStatus.SERVICE_UNAVAILABLE.value,
detail="Video generation handler not initialized.",
)
logger.info("Video generation handler: %s", type(handler).__name__)
try:
return await handler.generate_videos(request, raw_request)
except HTTPException:
raise
except Exception as e:
logger.exception("Video generation failed: %s", e)
raise HTTPException(
status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value,
detail=f"Video generation failed: {str(e)}",
)
10 changes: 10 additions & 0 deletions vllm_omni/entrypoints/openai/protocol/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,21 @@
ImageGenerationResponse,
ResponseFormat,
)
from vllm_omni.entrypoints.openai.protocol.videos import (
VideoData,
VideoGenerationRequest,
VideoGenerationResponse,
VideoResponseFormat,
)

__all__ = [
"ImageData",
"ImageGenerationRequest",
"ImageGenerationResponse",
"ResponseFormat",
"VideoData",
"VideoGenerationRequest",
"VideoGenerationResponse",
"VideoResponseFormat",
"OmniChatCompletionStreamResponse",
]
Loading