CLI Reference

Commands Overview

Command	Description
`vllm-mlx serve`	Start OpenAI-compatible server
`vllm-mlx-bench`	Run performance benchmarks
`vllm-mlx-chat`	Start Gradio chat interface

`vllm-mlx serve`

Start the OpenAI-compatible API server.

Usage

vllm-mlx serve <model> [options]

Options

Option	Description	Default
`--port`	Server port	8000
`--host`	Server host	0.0.0.0
`--api-key`	API key for authentication	None
`--rate-limit`	Requests per minute per client (0 = disabled)	0
`--timeout`	Request timeout in seconds	300
`--continuous-batching`	Enable batching for multi-user	False
`--cache-memory-mb`	Cache memory limit in MB	Auto
`--cache-memory-percent`	Fraction of RAM for cache	0.20
`--no-memory-aware-cache`	Use legacy entry-count cache	False
`--use-paged-cache`	Enable paged KV cache	False
`--max-tokens`	Default max tokens	32768
`--stream-interval`	Tokens per stream chunk	1
`--mcp-config`	Path to MCP config file	None
`--paged-cache-block-size`	Tokens per cache block	64
`--max-cache-blocks`	Maximum cache blocks	1000
`--max-num-seqs`	Max concurrent sequences	256
`--default-temperature`	Default temperature when not specified in request	None
`--default-top-p`	Default top_p when not specified in request	None
`--reasoning-parser`	Parser for reasoning models (`qwen3`, `deepseek_r1`)	None
`--embedding-model`	Pre-load an embedding model at startup	None
`--enable-auto-tool-choice`	Enable automatic tool calling	False
`--tool-call-parser`	Tool call parser (`auto`, `mistral`, `qwen`, `llama`, `hermes`, `deepseek`, `kimi`, `granite`, `nemotron`, `xlam`, `functionary`, `glm47`)	None

Examples

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --continuous-batching

# With memory limit for large models
vllm-mlx serve mlx-community/GLM-4.7-Flash-4bit \
  --continuous-batching \
  --cache-memory-mb 2048

# Production with paged cache
vllm-mlx serve mlx-community/Qwen3-0.6B-8bit \
  --continuous-batching \
  --use-paged-cache \
  --port 8000

# With MCP tools
vllm-mlx serve mlx-community/Qwen3-4B-4bit --mcp-config mcp.json

# Multimodal model
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit

# Reasoning model (separates thinking from answer)
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3

# DeepSeek reasoning model
vllm-mlx serve mlx-community/DeepSeek-R1-Distill-Qwen-7B-4bit --reasoning-parser deepseek_r1

# Tool calling with Mistral/Devstral
vllm-mlx serve mlx-community/Devstral-Small-2507-4bit \
  --enable-auto-tool-choice --tool-call-parser mistral

# Tool calling with Granite
vllm-mlx serve mlx-community/granite-4.0-tiny-preview-4bit \
  --enable-auto-tool-choice --tool-call-parser granite

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --api-key your-secret-key

# Production setup with security options
vllm-mlx serve mlx-community/Qwen3-4B-4bit \
  --api-key your-secret-key \
  --rate-limit 60 \
  --timeout 120 \
  --continuous-batching

Security

When --api-key is set, all API requests require the Authorization: Bearer <api-key> header:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key"  # Must match --api-key
)

Or with curl:

curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer your-secret-key"

`vllm-mlx-bench`

Run performance benchmarks.

Usage

vllm-mlx-bench --model <model> [options]

Options

Option	Description	Default
`--model`	Model name	Required
`--prompts`	Number of prompts	5
`--max-tokens`	Max tokens per prompt	256
`--quick`	Quick benchmark mode	False
`--video`	Run video benchmark	False
`--video-url`	Custom video URL	None
`--video-path`	Custom video path	None

Examples

# LLM benchmark
vllm-mlx-bench --model mlx-community/Llama-3.2-1B-Instruct-4bit

# Quick benchmark
vllm-mlx-bench --model mlx-community/Llama-3.2-1B-Instruct-4bit --quick

# Image benchmark (auto-detected for VLM models)
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit

# Video benchmark
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit --video

# Custom video
vllm-mlx-bench --model mlx-community/Qwen3-VL-8B-Instruct-4bit \
  --video --video-url https://example.com/video.mp4

`vllm-mlx-chat`

Start Gradio chat interface.

Usage

vllm-mlx-chat --model <model> [options]

Options

Option	Description	Default
`--model`	Model name	Required
`--port`	Gradio port	7860
`--text-only`	Disable multimodal	False

Examples

# Multimodal chat (text + images + video)
vllm-mlx-chat --model mlx-community/Qwen3-VL-4B-Instruct-3bit

# Text-only chat
vllm-mlx-chat --model mlx-community/Llama-3.2-3B-Instruct-4bit --text-only

Environment Variables

Variable	Description
`VLLM_MLX_TEST_MODEL`	Model for tests
`HF_TOKEN`	HuggingFace token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI Reference

Commands Overview

`vllm-mlx serve`

Usage

Options

Examples

Security

`vllm-mlx-bench`

Usage

Options

Examples

`vllm-mlx-chat`

Usage

Options

Examples

Environment Variables

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

CLI Reference

Commands Overview

vllm-mlx serve

Usage

Options

Examples

Security

vllm-mlx-bench

Usage

Options

Examples

vllm-mlx-chat

Usage

Options

Examples

Environment Variables

`vllm-mlx serve`

`vllm-mlx-bench`

`vllm-mlx-chat`