Skip to content

kiarina/mlx-qwen3-omni-server

Repository files navigation

mlx-qwen3-omni-server

OpenAI-compatible HTTP server for Qwen3-Omni on Apple Silicon, wrapping mlx-vlm.

One endpoint — POST /v1/chat/completions — accepts text + image + audio + video in a single message and returns text and/or tool calls. A single in-process worker runs generations one at a time (MLX is memory-heavy and thread-affine); concurrent requests queue. The model is loaded once and kept resident. Sync only (no async jobs).

Features

  • OpenAI Chat Completions compatiblemessages, tools, tool_choice. Use any OpenAI client; point base_url at this server.
  • Multimodal in: text, images, audio, and video, interleaved in content. Output is text only (Qwen3-Omni's speech/Talker output is not exposed).
  • Tool calling: selection from multiple tools, parallel calls, and forced calls via tool_choice (auto / none / required / a specific function).
  • Single-flight queue — generations never overlap; the API stays responsive.

Requirements

  • Apple Silicon (MLX), ~32 GB+ unified memory recommended for the 4-bit model.
  • ffmpeg on PATH (used to decode audio/video).
  • Weights are downloaded on first use from the configured repo (default mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit, ~22 GB) and cached by Hugging Face.

Run

uv run mlx-qwen3-omni-server
# or
uv run python -m mlx_qwen3_omni_server

On startup the model is loaded and a tiny warmup generation primes the MLX kernels. GET /health reports warm.

Configuration (environment variables)

Variable Default Meaning
MLX_QWEN3_OMNI_MODEL_REPO mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit Model repo
MLX_QWEN3_OMNI_HOST / _PORT 127.0.0.1 / 8000 Bind address
MLX_QWEN3_OMNI_AUTH_TOKEN (unset) If set, /v1/* requires Authorization: Bearer <token>
MLX_QWEN3_OMNI_WARMUP 1 Warm up at startup
MLX_QWEN3_OMNI_MAX_TOKENS 512 Default max_tokens
MLX_QWEN3_OMNI_MAX_TOKENS_CAP 4096 Hard cap on max_tokens
MLX_QWEN3_OMNI_TEMPERATURE 0.7 Default temperature
MLX_QWEN3_OMNI_TOP_P 1.0 Default top_p
MLX_QWEN3_OMNI_FPS 1.0 Frames/sec sampled from input videos
MLX_QWEN3_OMNI_USE_AUDIO_IN_VIDEO 1 If a video has sound, demux its audio and feed it as audio too

API

POST /v1/chat/completions

OpenAI Chat Completions request. content may be a string or a list of parts:

Part Shape Aliases
text {"type":"text","text":"..."}
image {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}} image, input_image; url may be http(s) or a local path
audio {"type":"input_audio","input_audio":{"data":"<base64>","format":"wav"}} audio, audio_url
video {"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}} video, input_video; url may be http(s) or a local path

tool_choice: "auto" (default) · "none" · "required" (alias "any") · {"type":"function","function":{"name":"..."}} (or a bare tool name).

# Text
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"自己紹介して"}]}'

# Image (data URL) + question
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":[
        {"type":"text","text":"この画像を説明して"},
        {"type":"image_url","image_url":{"url":"data:image/png;base64,iVBOR..."}}]}]}'

# Force a specific tool
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"大阪の天気は?"}],
       "tools":[{"type":"function","function":{"name":"get_weather",
         "parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
       "tool_choice":{"type":"function","function":{"name":"get_weather"}}}'

With the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="qwen3-omni",
    messages=[{"role": "user", "content": "こんにちは"}])
print(r.choices[0].message.content)

Heaviest path: video + frames + audio → forced tool call

A worked example combining everything at once: a 45 s music video (video_url), its audio track (input_audio), 15 timestamp-labelled frames (image_url), a record_highlight tool, and tool_choice forcing that function. The model analyses all modalities and returns a single, well-formed tool call. ~10 k prompt tokens, ~16 s, peak ~28–29 GB.

content = [
    {"type": "text", "text": "まず動画全体と、そのPVの楽曲です。"},
    {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<...>"}},
    {"type": "input_audio", "input_audio": {"data": "<base64 wav>", "format": "wav"}},
    {"type": "text", "text": "次に、3秒間隔の秒数ラベル付きフレームです。"},
    # repeated per frame, in order:
    {"type": "text", "text": "【t=0秒】"},  {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<...>"}},
    # ... t=3, 6, 9, ... 42 ...
    {"type": "text", "text": "最も印象的な瞬間の秒数と理由を引数に record_highlight を呼んで。"},
]
r = client.chat.completions.create(
    model="qwen3-omni",
    messages=[{"role": "user", "content": content}],
    tools=[{"type": "function", "function": {
        "name": "record_highlight",
        "parameters": {"type": "object", "properties": {
            "timestamp_seconds": {"type": "integer"},
            "reason": {"type": "string"}}, "required": ["timestamp_seconds", "reason"]}}}],
    tool_choice={"type": "function", "function": {"name": "record_highlight"}},
    extra_body={"fps": 1},
)

Response (choices[0].message.tool_calls[0]):

{
  "id": "call_8d0449a04e164491a150e069",
  "type": "function",
  "function": {
    "name": "record_highlight",
    "arguments": "{\"timestamp_seconds\": 33, \"reason\": \"キャラクターが手を振る仕草が映像全体の温かみと親しみやすさを象徴し、暖色系の照明と木造の内装、穏やかな楽曲のリズムと響き合う最も印象的な瞬間\"}"
  }
}

Tip: under heavy multimodal load, plain "please call the tool" instructions can be ignored or malformed. Use tool_choice="required" or a specific function to force a clean, schema-correct call. Parallel calls (multiple <tool_call> in one reply) are supported too — e.g. asking for the weather in three cities yields three get_weather tool calls.

GET /help returns a machine-readable usage guide. GET /health returns {status, warm, queue_len, model}.

Notes & limitations

  • Text output only — Qwen3-Omni can generate speech, but that path is not exposed here.
  • Sounded video is auto-handled: if an input video has an audio track, the server demuxes it with ffmpeg and feeds it as audio and video, so the model both sees and hears the clip (toggle with use_audio_in_video).
  • One audio input per request is used (mlx-vlm limitation; extras ignored). A video's demuxed audio counts toward this — don't also pass a separate audio alongside a sounded video.
  • Long/large videos cost many tokens and a lot of memory; keep them short and/or lower fps.
  • No streaming (stream:true → 400).
  • Bundles a small compatibility shim (compat.py) so image + video simultaneous input works on mlx-vlm 0.6.1 + mlx 0.31.x.

License

MIT

About

OpenAI-compatible Qwen3-Omni server on Apple Silicon (MLX / mlx-vlm). Text + image + audio + video in, text / tool calls out. Single-flight queue, resident model, sync-only.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors