OpenAI-compatible HTTP server for Qwen3-Omni on Apple Silicon, wrapping mlx-vlm.
One endpoint — POST /v1/chat/completions — accepts text + image + audio +
video in a single message and returns text and/or tool calls. A single
in-process worker runs generations one at a time (MLX is memory-heavy and
thread-affine); concurrent requests queue. The model is loaded once and kept
resident. Sync only (no async jobs).
- OpenAI Chat Completions compatible —
messages,tools,tool_choice. Use any OpenAI client; pointbase_urlat this server. - Multimodal in: text, images, audio, and video, interleaved in
content. Output is text only (Qwen3-Omni's speech/Talker output is not exposed). - Tool calling: selection from multiple tools, parallel calls, and forced
calls via
tool_choice(auto/none/required/ a specific function). - Single-flight queue — generations never overlap; the API stays responsive.
- Apple Silicon (MLX), ~32 GB+ unified memory recommended for the 4-bit model.
ffmpegonPATH(used to decode audio/video).- Weights are downloaded on first use from the configured repo (default
mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit, ~22 GB) and cached by Hugging Face.
uv run mlx-qwen3-omni-server
# or
uv run python -m mlx_qwen3_omni_serverOn startup the model is loaded and a tiny warmup generation primes the MLX
kernels. GET /health reports warm.
| Variable | Default | Meaning |
|---|---|---|
MLX_QWEN3_OMNI_MODEL_REPO |
mlx-community/Qwen3-Omni-30B-A3B-Instruct-4bit |
Model repo |
MLX_QWEN3_OMNI_HOST / _PORT |
127.0.0.1 / 8000 |
Bind address |
MLX_QWEN3_OMNI_AUTH_TOKEN |
(unset) | If set, /v1/* requires Authorization: Bearer <token> |
MLX_QWEN3_OMNI_WARMUP |
1 |
Warm up at startup |
MLX_QWEN3_OMNI_MAX_TOKENS |
512 |
Default max_tokens |
MLX_QWEN3_OMNI_MAX_TOKENS_CAP |
4096 |
Hard cap on max_tokens |
MLX_QWEN3_OMNI_TEMPERATURE |
0.7 |
Default temperature |
MLX_QWEN3_OMNI_TOP_P |
1.0 |
Default top_p |
MLX_QWEN3_OMNI_FPS |
1.0 |
Frames/sec sampled from input videos |
MLX_QWEN3_OMNI_USE_AUDIO_IN_VIDEO |
1 |
If a video has sound, demux its audio and feed it as audio too |
OpenAI Chat Completions request. content may be a string or a list of parts:
| Part | Shape | Aliases |
|---|---|---|
| text | {"type":"text","text":"..."} |
|
| image | {"type":"image_url","image_url":{"url":"data:image/png;base64,..."}} |
image, input_image; url may be http(s) or a local path |
| audio | {"type":"input_audio","input_audio":{"data":"<base64>","format":"wav"}} |
audio, audio_url |
| video | {"type":"video_url","video_url":{"url":"data:video/mp4;base64,..."}} |
video, input_video; url may be http(s) or a local path |
tool_choice: "auto" (default) · "none" · "required" (alias "any") ·
{"type":"function","function":{"name":"..."}} (or a bare tool name).
# Text
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"自己紹介して"}]}'
# Image (data URL) + question
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":[
{"type":"text","text":"この画像を説明して"},
{"type":"image_url","image_url":{"url":"data:image/png;base64,iVBOR..."}}]}]}'
# Force a specific tool
curl -s http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"大阪の天気は?"}],
"tools":[{"type":"function","function":{"name":"get_weather",
"parameters":{"type":"object","properties":{"location":{"type":"string"}},"required":["location"]}}}],
"tool_choice":{"type":"function","function":{"name":"get_weather"}}}'With the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="qwen3-omni",
messages=[{"role": "user", "content": "こんにちは"}])
print(r.choices[0].message.content)A worked example combining everything at once: a 45 s music video (video_url),
its audio track (input_audio), 15 timestamp-labelled frames (image_url), a
record_highlight tool, and tool_choice forcing that function. The model
analyses all modalities and returns a single, well-formed tool call. ~10 k prompt
tokens, ~16 s, peak ~28–29 GB.
content = [
{"type": "text", "text": "まず動画全体と、そのPVの楽曲です。"},
{"type": "video_url", "video_url": {"url": "data:video/mp4;base64,<...>"}},
{"type": "input_audio", "input_audio": {"data": "<base64 wav>", "format": "wav"}},
{"type": "text", "text": "次に、3秒間隔の秒数ラベル付きフレームです。"},
# repeated per frame, in order:
{"type": "text", "text": "【t=0秒】"}, {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,<...>"}},
# ... t=3, 6, 9, ... 42 ...
{"type": "text", "text": "最も印象的な瞬間の秒数と理由を引数に record_highlight を呼んで。"},
]
r = client.chat.completions.create(
model="qwen3-omni",
messages=[{"role": "user", "content": content}],
tools=[{"type": "function", "function": {
"name": "record_highlight",
"parameters": {"type": "object", "properties": {
"timestamp_seconds": {"type": "integer"},
"reason": {"type": "string"}}, "required": ["timestamp_seconds", "reason"]}}}],
tool_choice={"type": "function", "function": {"name": "record_highlight"}},
extra_body={"fps": 1},
)Response (choices[0].message.tool_calls[0]):
{
"id": "call_8d0449a04e164491a150e069",
"type": "function",
"function": {
"name": "record_highlight",
"arguments": "{\"timestamp_seconds\": 33, \"reason\": \"キャラクターが手を振る仕草が映像全体の温かみと親しみやすさを象徴し、暖色系の照明と木造の内装、穏やかな楽曲のリズムと響き合う最も印象的な瞬間\"}"
}
}Tip: under heavy multimodal load, plain "please call the tool" instructions can be ignored or malformed. Use
tool_choice="required"or a specific function to force a clean, schema-correct call. Parallel calls (multiple<tool_call>in one reply) are supported too — e.g. asking for the weather in three cities yields threeget_weathertool calls.
GET /help returns a machine-readable usage guide. GET /health returns
{status, warm, queue_len, model}.
- Text output only — Qwen3-Omni can generate speech, but that path is not exposed here.
- Sounded video is auto-handled: if an input video has an audio track, the
server demuxes it with
ffmpegand feeds it as audio and video, so the model both sees and hears the clip (toggle withuse_audio_in_video). - One audio input per request is used (mlx-vlm limitation; extras ignored). A video's demuxed audio counts toward this — don't also pass a separate audio alongside a sounded video.
- Long/large videos cost many tokens and a lot of memory; keep them short
and/or lower
fps. - No streaming (
stream:true→ 400). - Bundles a small compatibility shim (
compat.py) so image + video simultaneous input works on mlx-vlm 0.6.1 + mlx 0.31.x.
MIT