Skip to content

Add Initial Qwen-2.5VL Image Processor Support#1008

Merged
sayanshaw24 merged 7 commits intomainfrom
sayanshaw/qwen2-5-vl
Nov 26, 2025
Merged

Add Initial Qwen-2.5VL Image Processor Support#1008
sayanshaw24 merged 7 commits intomainfrom
sayanshaw/qwen2-5-vl

Conversation

@sayanshaw24
Copy link
Collaborator

@sayanshaw24 sayanshaw24 commented Nov 12, 2025

Description

This PR introduces support for Qwen2.5-VL image preprocessing in ONNX Runtime Extensions. It implements the full resize (including smart resize) → rescale → normalize → patching pipeline required by Qwen2.5-VL-7B-Instruct.

Key updates:

  • New Qwen2.5-VL–compatible PatchImage op

    • Converts RGB HWC to CHW channel ordering
    • Performs accurate temporal padding and 9-D reshape/transpose logic
    • Produces Python-aligned patch embeddings
  • Smart Resize support in the Resize op

    • Implements Qwen2.5-VL’s shortest-edge and longest-edge constraints
    • Supports pixel-count–based resizing (min/max pixels)
    • Matches smart_resize behavior from the HF transformers
  • Qwen2.5-VL Normalization handling

    • Fixes channel-order expectations across resize → normalize → patch
    • Ensures exact mean/std behavior equivalent to HF transformers
    • Includes C++ normalization adjustments when Qwen2.5-VL mode is enabled
  • Qwen2.5-VL processor JSON

    • Includes Qwen-specific resize parameters
    • Correct mean/std normalization
    • Patch dimensions (patch_size, merge_size, temporal_patch_size)
  • RGB Correctness Guarantees

    • Fixes the BGR→RGB mismatch observed in PatchImage
    • Ensures alignment with the Qwen2.5VL

This enables ORT Extensions processing support for Qwen2.5-VL vision & multimodal models across ORT Extensions, ready-to-use in ORT GenAI. Note that we currently only provide single-image and non-video support.

Validation

  • C++ unit tests verifying pixel_values match reference output; MSE comparison: achieved ≤ 1e-3
  • Step-by-step parity checks across:
    • Resize
    • Rescale
    • Normalize
    • Patch extraction
  • Regression-tested with Phi-3-V, Phi-4, MLlama, Gemma, etc. — no changes in their outputs

@sayanshaw24 sayanshaw24 marked this pull request as ready for review November 20, 2025 00:05
@sayanshaw24 sayanshaw24 requested a review from a team as a code owner November 20, 2025 00:05
@sayanshaw24 sayanshaw24 enabled auto-merge (squash) November 20, 2025 18:41
@sayanshaw24 sayanshaw24 merged commit ccab2e5 into main Nov 26, 2025
37 checks passed
@sayanshaw24 sayanshaw24 deleted the sayanshaw/qwen2-5-vl branch November 26, 2025 19:48
}

// Add batch dimension (frames = 1 initially) and prepare patches vector
std::vector<float> patches = chw;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shall avoid such vector copy.

I think we can allocate with padded space. Fill them with padded value. Then apply HWC -> CHW (need to compute destination offset with padded height / width).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants