Open
Description
Currently, the only multi-modal models that have been migrated to the "unified" architecture are Gemma3 and Pixtral:
mlx-engine/mlx_engine/model_kit/model_kit.py
Lines 35 to 38 in ecc2cf4
Extending this pattern to Qwen2.5VL/Qwen2VL is desired.
Relevant mlx-vlm
components:
- https://github.com/Blaizzy/mlx-vlm/tree/2068970094c78878c77fd78677d1316933562ade/mlx_vlm/models/qwen2_5_vl
- https://github.com/Blaizzy/mlx-vlm/tree/main/mlx_vlm/models/qwen2_vl
Relevant mlx-lm
components:
This will likely look like:
- Ensure Qwen2.5VL text model architecture is implemented correctly in
mlx-lm
(including MRoPE, see https://arxiv.org/abs/2502.13923 for details and Apply PR #319 fixes to Qwen 2.5VL position id #349 for mlx-vlm in progress work) - Implement
Qwen2_5_VLVisionAddOn
and wire it inModelKit
- Ensure Qwen2.5VL tests in
mlx-engine
still pass