Skip to content

Vision/Image support for custom provider models (e.g. Xiaomi MiMo-V2.5) #3088

Description

@ajangsupardi

Description

MiMo-V2.5 is a native omnimodal model (text, image, video, audio) from Xiaomi, but when configured as a custom provider in crush.json, Crush returns:

This model (MiMo-V2.5) does not support image data.

MiMo-V2.5 is not listed in Crush's internal model registry, so Crush blocks image input even though the model supports it natively via a 729M-param Vision Transformer encoder.

Expected Behavior

Custom provider models should be able to receive image data, either:

  1. By adding a vision: true (or capabilities) field to custom model definitions in crush.json
  2. By adding MiMo-V2.5 to the vision-capable model list in Catwalk

Current Model Config

{
  "providers": {
    "xiaomi": {
      "type": "anthropic",
      "base_url": "https://token-plan-sgp.xiaomimimo.com/anthropic",
      "models": [
        {
          "id": "mimo-v2.5",
          "name": "MiMo-V2.5",
          "context_window": 262144
        }
      ]
    }
  }
}

Suggested Fix

Allow custom provider models to declare capabilities, e.g.:

{
  "id": "mimo-v2.5",
  "name": "MiMo-V2.5",
  "context_window": 262144,
  "vision": true
}

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions