Skip to content

Extend VisionAddOn Pattern to Qwen2.5VL #167

Open
@mattjcly

Description

@mattjcly

Currently, the only multi-modal models that have been migrated to the "unified" architecture are Gemma3 and Pixtral:

VISION_ADD_ON_MAP = {
"gemma3": Gemma3VisionAddOn,
"pixtral": PixtralVisionAddOn,
}

Extending this pattern to Qwen2.5VL/Qwen2VL is desired.

Relevant mlx-vlm components:

Relevant mlx-lm components:

This will likely look like:

  1. Ensure Qwen2.5VL text model architecture is implemented correctly in mlx-lm (including MRoPE, see https://arxiv.org/abs/2502.13923 for details and Apply PR #319 fixes to Qwen 2.5VL position id #349 for mlx-vlm in progress work)
  2. Implement Qwen2_5_VLVisionAddOn and wire it in ModelKit
  3. Ensure Qwen2.5VL tests in mlx-engine still pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions