Skip to content

[Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API#1073

Open
SamitHuang wants to merge 15 commits intovllm-project:mainfrom
SamitHuang:wan22_online
Open

[Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API#1073
SamitHuang wants to merge 15 commits intovllm-project:mainfrom
SamitHuang:wan22_online

Conversation

@SamitHuang
Copy link
Collaborator

@SamitHuang SamitHuang commented Jan 29, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  • Support Wan2.2 T2V and I2V Online Serving
  • Add OpenAI-style T2V and I2V generation API for Wan2.2, which can be reused by other text-to-video models

New APIs

POST /v1/videos

OpenAI-style video generation endpoint.

Main Logic

The handler maps request fields to OmniDiffusionSamplingParams, routes to
the correct execution backend, extracts the video output, and encodes MP4
to base64.

Client
  |
  | POST /v1/videos (multipart)
  v
APIServer
  |
  v
OmniOpenAIServingVideo
  |
  v
OmniDiffusionSamplingParams
  |
  v
Backend Router
  |----------------------|
  |                      |
  v                      v
AsyncOmniDiffusion    AsyncOmni
  |                      |
  v                      v
DiffusionEngine      DiffusionEngine   (Wan2.2 T2V or I2V)
  |                      |
  v                      v
OmniRequestOutput    OmniRequestOutput
  \______________________/
             |
             v
      encode_video_base64
             |
             v
   VideoGenerationResponse
             |
             v
           Client

When a t2v or i2v request arrives:

  1. For i2v, decode input_reference (reference image) and attach it to multi_modal_data.image.
  2. Parse request and assemble OmniDiffusionSamplingParams.
  3. Determine backend:
    • Pure diffusion: single diffusion stage; use AsyncOmni or
      AsyncOmniDiffusion depending on server configuration.
    • Multi-stage: build sampling_params_list aligned with stage types.
  4. Extract video outputs from OmniRequestOutput.
  5. Encode MP4 with diffusers.utils.export_to_video.
  6. Return VideoGenerationResponse with b64_json.

Main Changes

  • Protocol schema: vllm_omni/entrypoints/openai/protocol/videos.py
  • API utils: vllm_omni/entrypoints/openai/video_api_utils.py
  • Handler: vllm_omni/entrypoints/openai/serving_video.py
  • Routing: vllm_omni/entrypoints/openai/api_server.py
  • Compatibility: vllm_omni/entrypoints/async_omni.py
  • Example:
    • examples/online_serving/text_to_video/run_curl_text_to_video.sh
    • examples/online_serving/image_to_video/run_curl_image_to_video.sh

Test Plan

T2V

Lauch the server

vllm serve Wan-AI/Wan2.2-T2V-A14B-Diffusers --omni --port 8091

Send request via curl

curl -X POST http://localhost:8091/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage." \
  -F "size=832x480" \
  -F "num_frames=33" \
  -F "fps=16" \
  -F "negative_prompt=色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=4.0" \
  -F "guidance_scale_2=4.0" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=5.0" \
  -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > "wan22_t2v_output.mp4"

I2V

Launch the server

vllm serve Wan-AI/Wan2.2-I2V-A14B-Diffusers --omni --port 8091 --boundary-ratio 0.875 --flow-shift 12.0

Send request via curl (multipart)

curl -X POST http://localhost:8091/v1/videos \
  -H "Accept: application/json" \
  -F "prompt=A bear playing with yarn, smooth motion" \
  -F "negative_prompt=low quality, blurry, static" \
  -F "input_reference=@examples/offline_inference/image_to_video/qwen-bear.png" \
  -F "size=832x480" \
  -F "num_frames=33" \
  -F "fps=16" \
  -F "num_inference_steps=40" \
  -F "guidance_scale=1.0" \
  -F "guidance_scale_2=1.0" \
  -F "boundary_ratio=0.875" \
  -F "flow_shift=12.0" \
  -F "seed=42" | jq -r '.data[0].b64_json' | base64 -d > "wan22_i2v_output.mp4"

Test Result

T2V

wan22_output.mp4

I2V

wan22_i2v_output.mp4

Future consideration

  • Async processing for video generation, storage, and retrieval
  • video streaming output

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Introduce a video generations API with an extensible request schema and shared diffusion routing so Wan2.2 and future video models can be served consistently.

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang marked this pull request as draft January 29, 2026 09:28
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b170ef4fc5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang marked this pull request as ready for review January 29, 2026 10:01
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4a5b024bd5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 28 to 31
if video_tensor.is_floating_point():
video_tensor = video_tensor.clamp(-1, 1) * 0.5 + 0.5
video_array = video_tensor.float().numpy()
return _normalize_single_video_array(video_array)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize uint8 tensors before float cast

If a model returns video frames as a uint8 torch tensor (0–255), _normalize_video_tensor casts to float before calling _normalize_single_video_array. That skips the integer-scaling path and instead clamps values to [-1, 1], turning most pixels into 1.0 (washed‑out/white frames). Handle integer tensors before the float cast (e.g., scale by 255 or preserve dtype) so post‑processed uint8 outputs encode correctly.

Useful? React with 👍 / 👎.

@david6666666
Copy link
Collaborator

Does this PR support text-to-video and use the same endpoint?

@SamitHuang
Copy link
Collaborator Author

Does this PR support text-to-video and use the same endpoint?

it supports other T2V models with the same geneneration endpoint

@david6666666
Copy link
Collaborator

Does this PR support text-to-video and use the same endpoint?

it supports other T2V models with the same geneneration endpoint

Sorry, image-to-video does this pr support ed?

@SamitHuang
Copy link
Collaborator Author

Sorry, image-to-video does this pr support ed?

not currently

@Bounty-hunter
Copy link
Contributor

Bounty-hunter commented Feb 3, 2026

should also support /v1/videos? https://platform.openai.com/docs/api-reference/videos/create , which is multipart/form-data

Signed-off-by: samithuang <285365963@qq.com>
@david6666666
Copy link
Collaborator

should also support /v1/videos? https://platform.openai.com/docs/api-reference/videos/create , which is multipart/form-data

I think we should follow this openai api endpoint @SamitHuang WDYT

Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang changed the title [Feature] Support Wan2.2 T2V Online Serving [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API Feb 6, 2026
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang
Copy link
Collaborator Author

should also support /v1/videos? https://platform.openai.com/docs/api-reference/videos/create , which is multipart/form-data

I think we should follow this openai api endpoint @SamitHuang WDYT

agree, i have updated accordingly

Signed-off-by: samithuang <285365963@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants