mlx-qwen3-omni-server

OpenAI-compatible HTTP server for Qwen3-Omni on Apple Silicon, wrapping mlx-vlm.

One endpoint — POST /v1/chat/completions — accepts text + image + audio + video in a single message and returns text and/or tool calls. A single in-process worker runs generations one at a time (MLX is memory-heavy and thread-affine); concurrent requests queue. The model is loaded once and kept resident. Sync only (no async jobs).

Features

OpenAI Chat Completions compatible — messages, tools, tool_choice. Use any OpenAI client; point base_url at this server.
Multimodal in: text, images, audio, and video, interleaved in content. Output is text only (Qwen3-Omni's speech/Talker output is not exposed).
Tool calling: selection from multiple tools, parallel calls, and forced calls via tool_choice (auto / none / required / a specific function).
Single-flight queue — generations never overlap; the API stays responsive.

Requirements

Apple Silicon (MLX), ~32 GB+ unified memory recommended for the 4-bit model.
ffmpeg on PATH (used to decode audio/video).
Weights are downloaded on first use from the configured repo (default mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit, ~22 GB) and cached by Hugging Face.

Run

uv run mlx-qwen3-omni-server
# or
uv run python -m mlx_qwen3_omni_server

On startup the model is loaded and a tiny warmup generation primes the MLX kernels. GET /health reports warm.

Configuration (environment variables)

Variable	Default	Meaning
`MLX_QWEN3_OMNI_MODEL_REPO`	`mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit`	Model repo
`MLX_QWEN3_OMNI_HOST` / `_PORT`	`127.0.0.1` / `8000`	Bind address
`MLX_QWEN3_OMNI_AUTH_TOKEN`	(unset)	If set, `/v1/*` requires `Authorization: Bearer <token>`
`MLX_QWEN3_OMNI_WARMUP`	`1`	Warm up at startup
`MLX_QWEN3_OMNI_MAX_TOKENS`	`512`	Default `max_tokens`
`MLX_QWEN3_OMNI_MAX_TOKENS_CAP`	`4096`	Hard cap on `max_tokens`
`MLX_QWEN3_OMNI_TEMPERATURE`	`0.7`	Default temperature
`MLX_QWEN3_OMNI_TOP_P`	`1.0`	Default top_p
`MLX_QWEN3_OMNI_FPS`	`1.0`	Frames/sec sampled from input videos
`MLX_QWEN3_OMNI_USE_AUDIO_IN_VIDEO`	`1`	If a video has sound, demux its audio and feed it as audio too

API

`POST /v1/chat/completions`

OpenAI Chat Completions request. content may be a string or a list of parts:

Part	Shape	Aliases
text	`{"type":"text","text":"..."}`
image	`{"type":"image_url","image_url":{"url":"data:image/png;base64,..."}}`	`image`, `input_image`; `url` may be http(s) or a local path
audio	`{"type":"input_audio","input_audio":{"data":"<base64>","format":"wav"}}`	`audio`, `audio_url`
video	`{"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}}`	`video`, `input_video`; `url` may be http(s) or a local path

tool_choice: "auto" (default) · "none" · "required" (alias "any") · {"type":"function","function":{"name":"..."}} (or a bare tool name).

# Text
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"自己紹介して"}]}'

# Image (data URL) + question
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":[
        {"type":"text","text":"この画像を説明して"},
        {"type":"image_url","image_url":{"url":"data:image/png;base64,iVBOR..."}}]}]}'

# Force a specific tool
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"大阪の天気は?"}],
       "tools":[{"type":"function","function":{"name":"get_weather",
         "parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
       "tool_choice":{"type":"function","function":{"name":"get_weather"}}}'

With the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="qwen3-omni",
    messages=[{"role": "user", "content": "こんにちは"}])
print(r.choices[0].message.content)

Heaviest path: video + frames + audio → forced tool call

A worked example combining everything at once: a 45 s music video (video_url), its audio track (input_audio), 15 timestamp-labelled frames (image_url), a record_highlight tool, and tool_choice forcing that function. The model analyses all modalities and returns a single, well-formed tool call. ~10 k prompt tokens, ~16 s, peak ~28–29 GB.

content = [
    {"type": "text", "text": "まず動画全体と、そのPVの楽曲です。"},
    {"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<...>"}},
    {"type": "input_audio", "input_audio": {"data": "<base64 wav>", "format": "wav"}},
    {"type": "text", "text": "次に、3秒間隔の秒数ラベル付きフレームです。"},
    # repeated per frame, in order:
    {"type": "text", "text": "【t=0秒】"},  {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<...>"}},
    # ... t=3, 6, 9, ... 42 ...
    {"type": "text", "text": "最も印象的な瞬間の秒数と理由を引数に record_highlight を呼んで。"},
]
r = client.chat.completions.create(
    model="qwen3-omni",
    messages=[{"role": "user", "content": content}],
    tools=[{"type": "function", "function": {
        "name": "record_highlight",
        "parameters": {"type": "object", "properties": {
            "timestamp_seconds": {"type": "integer"},
            "reason": {"type": "string"}}, "required": ["timestamp_seconds", "reason"]}}}],
    tool_choice={"type": "function", "function": {"name": "record_highlight"}},
    extra_body={"fps": 1},
)

Response (choices[0].message.tool_calls[0]):

{
  "id": "call_8d0449a04e164491a150e069",
  "type": "function",
  "function": {
    "name": "record_highlight",
    "arguments": "{\"timestamp_seconds\": 33, \"reason\": \"キャラクターが手を振る仕草が映像全体の温かみと親しみやすさを象徴し、暖色系の照明と木造の内装、穏やかな楽曲のリズムと響き合う最も印象的な瞬間\"}"
  }
}

Tip: under heavy multimodal load, plain "please call the tool" instructions can be ignored or malformed. Use tool_choice="required" or a specific function to force a clean, schema-correct call. Parallel calls (multiple <tool_call> in one reply) are supported too — e.g. asking for the weather in three cities yields three get_weather tool calls.

GET /help returns a machine-readable usage guide. GET /health returns {status, warm, queue_len, model}.

Notes & limitations

Text output only — Qwen3-Omni can generate speech, but that path is not exposed here.
Sounded video is auto-handled: if an input video has an audio track, the server demuxes it with ffmpeg and feeds it as audio and video, so the model both sees and hears the clip (toggle with use_audio_in_video).
One audio input per request is used (mlx-vlm limitation; extras ignored). A video's demuxed audio counts toward this — don't also pass a separate audio alongside a sounded video.
Long/large videos cost many tokens and a lot of memory; keep them short and/or lower fps.
No streaming (stream:true → 400).
Bundles a small compatibility shim (compat.py) so image + video simultaneous input works on mlx-vlm 0.6.1 + mlx 0.31.x.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.mise/tasks/service		.mise/tasks/service
mlx_qwen3_omni_server		mlx_qwen3_omni_server
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.ja.md		README.ja.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-qwen3-omni-server

Features

Requirements

Run

Configuration (environment variables)

API

`POST /v1/chat/completions`

Heaviest path: video + frames + audio → forced tool call

Notes & limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-qwen3-omni-server

Features

Requirements

Run

Configuration (environment variables)

API

POST /v1/chat/completions

Heaviest path: video + frames + audio → forced tool call

Notes & limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/chat/completions`

Packages