vllm-project · SamitHuang · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026
@@ -0,0 +1,121 @@
+# Text-To-Video
+
+This example demonstrates how to deploy the Wan2.2 text-to-video model for online video generation using vLLM-Omni.
+
+## Start Server
+
+### Basic Start
+
+```bash
+vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091
+```
+
+### Start with Parameters
+
+Or use the startup script:
+
+```bash
+bash run_server.sh
+```
+
+The script allows overriding:
+- `MODEL` (default: `Wan-AI/Wan2.2-T2V-A14B-Diffusers`)
+- `PORT` (default: `8091`)
+- `BOUNDARY_RATIO` (default: `0.875`)
+- `FLOW_SHIFT` (default: `5.0`)
+- `CACHE_BACKEND` (default: `none`)
+- `ENABLE_CACHE_DIT_SUMMARY` (default: `0`)
+
+## API Calls
+
+### Method 1: Using curl
+
+```bash
+# Basic text-to-video generation
+bash run_curl_text_to_video.sh
+
+# Or execute directly
+curl -s http://localhost:8091/v1/videos/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+    "negative_prompt": "色调艳丽 ，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    "height": 480,
+    "width": 832,
+    "num_frames": 33,
+    "fps": 16,
+    "num_inference_steps": 40,
+    "guidance_scale": 4.0,
+    "guidance_scale_2": 4.0,
+    "boundary_ratio": 0.875,
+    "seed": 42
+  }' | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4
+```
+
+## Request Format
+
+### Simple Text Generation
+
+```json
+{
+  "prompt": "A cinematic view of a futuristic city at sunset"
+}
+```
+
+### Generation with Parameters
+
+```json
+{
+  "prompt": "A cinematic view of a futuristic city at sunset",
+  "negative_prompt": "low quality, blurry, static",
+  "width": 832,
+  "height": 480,
+  "num_frames": 33,
+  "fps": 16,
+  "num_inference_steps": 40,
+  "guidance_scale": 4.0,
+  "guidance_scale_2": 4.0,
+  "boundary_ratio": 0.875,
+  "flow_shift": 5.0,
+  "seed": 42
+}
+```
+
+## Generation Parameters
+
+| Parameter             | Type   | Default | Description                                      |
+| --------------------- | ------ | ------- | ------------------------------------------------ |
+| `prompt`              | str    | -       | Text description of the desired video            |
+| `negative_prompt`     | str    | None    | Negative prompt                                  |
+| `n`                   | int    | 1       | Number of videos to generate                     |
+| `size`                | str    | None    | Video size, e.g. `"832x480"`                     |
+| `width`               | int    | None    | Video width in pixels                            |
+| `height`              | int    | None    | Video height in pixels                           |
+| `num_frames`          | int    | None    | Number of frames to generate                     |
+| `fps`                 | int    | None    | Frames per second for output video               |
+| `num_inference_steps` | int    | None    | Number of denoising steps                        |
+| `guidance_scale`      | float  | None    | CFG guidance scale (low-noise stage)             |
+| `guidance_scale_2`    | float  | None    | CFG guidance scale (high-noise stage, Wan2.2)     |
+| `boundary_ratio`      | float  | None    | Boundary split ratio for low/high DiT (Wan2.2)   |
+| `flow_shift`          | float  | None    | Scheduler flow shift (Wan2.2)                    |
+| `seed`                | int    | None    | Random seed (reproducible)                       |
+| `lora`                | object | None    | LoRA configuration                               |
+| `extra_body`          | object | None    | Model-specific extra parameters                  |
+
+## Response Format
+
+```json
+{
+  "created": 1234567890,
+  "data": [
+    { "b64_json": "<base64-mp4>" }
+  ]
+}
+```
+
+## Extract Video
+
+```bash
+# Extract base64 from response and decode to video
+cat response.json | jq -r '.data[0].b64_json' | base64 -d > wan22_output.mp4
+```
@@ -0,0 +1,23 @@
+#!/bin/bash
+# Wan2.2 text-to-video curl example
+
+OUTPUT_PATH="wan22_output.mp4"
+
+curl -X POST http://localhost:8091/v1/videos/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.",
+    "negative_prompt": "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
+    "height": 480,
+    "width": 832,
+    "num_frames": 33,
+    "fps": 16,
+    "num_inference_steps": 40,
+    "guidance_scale": 4.0,
+    "guidance_scale_2": 4.0,
+    "boundary_ratio": 0.875,
+    "flow_shift": 5.0,
+    "seed": 42
+  }' | jq -r '.data[0].b64_json' | base64 -d > "${OUTPUT_PATH}"
+
+echo "Saved video to ${OUTPUT_PATH}"
@@ -0,0 +1,31 @@
+#!/bin/bash
+# Wan2.2 online serving startup script
+
+MODEL="${MODEL:-Wan-AI/Wan2.2-T2V-A14B-Diffusers}"
+PORT="${PORT:-8091}"
+BOUNDARY_RATIO="${BOUNDARY_RATIO:-0.875}"
+FLOW_SHIFT="${FLOW_SHIFT:-5.0}"
+CACHE_BACKEND="${CACHE_BACKEND:-none}"
+ENABLE_CACHE_DIT_SUMMARY="${ENABLE_CACHE_DIT_SUMMARY:-0}"
+
+echo "Starting Wan2.2 server..."
+echo "Model: $MODEL"
+echo "Port: $PORT"
+echo "Boundary ratio: $BOUNDARY_RATIO"
+echo "Flow shift: $FLOW_SHIFT"
+echo "Cache backend: $CACHE_BACKEND"
+if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then
+    echo "Cache-DiT summary: enabled"
+fi
+
+CACHE_BACKEND_FLAG=""
+if [ "$CACHE_BACKEND" != "none" ]; then
+    CACHE_BACKEND_FLAG="--cache-backend $CACHE_BACKEND"
+fi
+
+vllm serve "$MODEL" --omni \
+    --port "$PORT" \
+    --boundary-ratio "$BOUNDARY_RATIO" \
+    --flow-shift "$FLOW_SHIFT" \
+    $CACHE_BACKEND_FLAG \
+    $(if [ "$ENABLE_CACHE_DIT_SUMMARY" != "0" ]; then echo "--enable-cache-dit-summary"; fi)
@@ -358,6 +358,9 @@ def forward(
         self._guidance_scale = guidance_low
         self._guidance_scale_2 = guidance_high
 
+        # Prefer engine-configured boundary_ratio, but allow per-request fallback.
+        boundary_ratio = self.boundary_ratio if self.boundary_ratio is not None else req.sampling_params.boundary_ratio
+
         # validate shapes
         self.check_inputs(
             prompt=prompt,
@@ -366,7 +369,8 @@ def forward(
             width=width,
             prompt_embeds=prompt_embeds,
             negative_prompt_embeds=negative_prompt_embeds,
-            guidance_scale_2=guidance_high if self.boundary_ratio is not None else None,
+            guidance_scale_2=guidance_high if boundary_ratio is not None else None,
+            boundary_ratio=boundary_ratio,
         )
 
         if num_frames % self.vae_scale_factor_temporal != 1:
@@ -407,8 +411,8 @@ def forward(
         timesteps = self.scheduler.timesteps
         self._num_timesteps = len(timesteps)
         boundary_timestep = None
-        if self.boundary_ratio is not None:
-            boundary_timestep = self.boundary_ratio * self.scheduler.config.num_train_timesteps
+        if boundary_ratio is not None:
+            boundary_timestep = boundary_ratio * self.scheduler.config.num_train_timesteps
 
         # Handle I2V mode when expand_timesteps=True and image is provided
         multi_modal_data = req.prompts[0].get("multi_modal_data", {}) if not isinstance(req.prompts[0], str) else None
@@ -695,6 +699,7 @@ def check_inputs(
         prompt_embeds=None,
         negative_prompt_embeds=None,
         guidance_scale_2=None,
+        boundary_ratio=None,
     ):
         if height % 16 != 0 or width % 16 != 0:
             raise ValueError(f"`height` and `width` have to be divisible by 16 but are {height} and {width}.")
@@ -721,5 +726,5 @@ def check_inputs(
         ):
             raise ValueError(f"`negative_prompt` has to be of type `str` or `list` but is {type(negative_prompt)}")
 
-        if self.boundary_ratio is None and guidance_scale_2 is not None:
+        if boundary_ratio is None and guidance_scale_2 is not None:
             raise ValueError("`guidance_scale_2` is only supported when `boundary_ratio` is set.")
@@ -79,8 +79,13 @@
     ImageGenerationRequest,
     ImageGenerationResponse,
 )
+from vllm_omni.entrypoints.openai.protocol.videos import (
+    VideoGenerationRequest,
+    VideoGenerationResponse,
+)
 from vllm_omni.entrypoints.openai.serving_chat import OmniOpenAIServingChat
 from vllm_omni.entrypoints.openai.serving_speech import OmniOpenAIServingSpeech
+from vllm_omni.entrypoints.openai.serving_video import OmniOpenAIServingVideo
 from vllm_omni.inputs.data import OmniDiffusionSamplingParams, OmniSamplingParams, OmniTextPrompt
 from vllm_omni.lora.request import LoRARequest
 from vllm_omni.lora.utils import stable_lora_int_id
@@ -373,6 +378,12 @@ async def omni_init_app_state(
             diffusion_engine=engine_client,  # type: ignore
             model_name=model_name,
         )
+        diffusion_stage_configs = engine_client.stage_configs if hasattr(engine_client, "stage_configs") else None
+        state.openai_serving_video = OmniOpenAIServingVideo.for_diffusion(
+            diffusion_engine=engine_client,  # type: ignore
+            model_name=model_name,
+            stage_configs=diffusion_stage_configs,
+        )
 
         state.enable_server_load_tracking = getattr(args, "enable_server_load_tracking", False)
         state.server_load_metrics = 0
@@ -655,6 +666,11 @@ async def omni_init_app_state(
     state.openai_serving_speech = OmniOpenAIServingSpeech(
         engine_client, state.openai_serving_models, request_logger=request_logger
     )
+    state.openai_serving_video = OmniOpenAIServingVideo(
+        engine_client,
+        model_name=served_model_names[0] if served_model_names else None,
+        stage_configs=state.stage_configs,
+    )
 
     state.enable_server_load_tracking = args.enable_server_load_tracking
     state.server_load_metrics = 0
@@ -668,6 +684,10 @@ def Omnispeech(request: Request) -> OmniOpenAIServingSpeech | None:
     return request.app.state.openai_serving_speech
 
 
+def Omnivideo(request: Request) -> OmniOpenAIServingVideo | None:
+    return request.app.state.openai_serving_video
+
+
 @router.post(
     "/v1/chat/completions",
     dependencies=[Depends(validate_json_request)],
@@ -1065,3 +1085,34 @@ async def generate_images(request: ImageGenerationRequest, raw_request: Request)
         raise HTTPException(
             status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value, detail=f"Image generation failed: {str(e)}"
         )
+
+
+@router.post(
+    "/v1/videos/generations",
+    dependencies=[Depends(validate_json_request)],
+    responses={
+        HTTPStatus.OK.value: {"model": VideoGenerationResponse},
+        HTTPStatus.BAD_REQUEST.value: {"model": ErrorResponse},
+        HTTPStatus.SERVICE_UNAVAILABLE.value: {"model": ErrorResponse},
+        HTTPStatus.INTERNAL_SERVER_ERROR.value: {"model": ErrorResponse},
+    },
+)
+async def generate_videos(request: VideoGenerationRequest, raw_request: Request) -> VideoGenerationResponse:
+    """Generate videos from text prompts using diffusion models."""
+    handler = Omnivideo(raw_request)
+    if handler is None:
+        raise HTTPException(
+            status_code=HTTPStatus.SERVICE_UNAVAILABLE.value,
+            detail="Video generation handler not initialized.",
+        )
+    logger.info("Video generation handler: %s", type(handler).__name__)
+    try:
+        return await handler.generate_videos(request, raw_request)
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.exception("Video generation failed: %s", e)
+        raise HTTPException(
+            status_code=HTTPStatus.INTERNAL_SERVER_ERROR.value,
+            detail=f"Video generation failed: {str(e)}",
+        )
@@ -8,11 +8,21 @@
     ImageGenerationResponse,
     ResponseFormat,
 )
+from vllm_omni.entrypoints.openai.protocol.videos import (
+    VideoData,
+    VideoGenerationRequest,
+    VideoGenerationResponse,
+    VideoResponseFormat,
+)
 
 __all__ = [
     "ImageData",
     "ImageGenerationRequest",
     "ImageGenerationResponse",
     "ResponseFormat",
+    "VideoData",
+    "VideoGenerationRequest",
+    "VideoGenerationResponse",
+    "VideoResponseFormat",
     "OmniChatCompletionStreamResponse",
 ]