- Mac with Apple Silicon (M1, M2, M3, M4, or M5 — any variant)
- macOS 14+ (Sonoma or newer)
- Python 3.10+
- At least 16GB RAM (more = better performance)
# Clone the repo
git clone https://github.com/szibis/MLX-Flash.git
cd MLX-Flash
# Create virtual environment
uv venv && source .venv/bin/activate
# Install dependencies
uv pip install lz4 zstandard numpy psutil tabulate pytest mlx mlx-lm
# Build C acceleration library (optional but recommended)
make -C csrc installpython -m mlx_flash_compress.hardwareThis shows your Mac's specs and what models you can run:
Detected: Apple M3 Max, 36GB RAM, 1TB SSD
Model Fits? Hit% tok/s
Qwen MoE (5GB) YES 100% 115
Mixtral-8x7B (26GB) YES 100% 16
DeepSeek-V3 (170GB) NO 68% 3.7
# Small model (downloads ~5GB, fits in RAM)
python -m mlx_flash_compress.run \
--model mlx-community/Qwen3-30B-A3B-4bit \
--tokens 100
# With task-specific optimization
python -m mlx_flash_compress.run \
--model mlx-community/Qwen3-30B-A3B-4bit \
--task coding \
--tokens 100
# With adaptive profiling (learns what you need)
python -m mlx_flash_compress.run \
--model mlx-community/Qwen3-30B-A3B-4bit \
--adaptive \
--tokens 200# For a specific model size on your hardware
python -m mlx_flash_compress.tier_optimizer \
--total-ram 36 --model-gb 209 --layers 60 --experts 512
# Output: optimal RAM/SSD split, expected tok/s, cache hit rateflowchart LR
A[Install] --> B[Check Hardware]
B --> C{Model fits?}
C -->|Yes, easily| D[python -m mlx_flash_compress.chat]
C -->|Barely fits| E[Enable mixed precision]
C -->|Too large| F[Enable SSD streaming]
E --> D
F --> G[python -m mlx_flash_compress.serve]
The Rust binary is a single entry point that manages everything:
# Simplest — auto-selects model, launches Python worker, serves on :8080
mlx-flash-server --port 8080
# Specify model + number of workers
mlx-flash-server --port 8080 --model mlx-community/Qwen3-30B-A3B-4bit --workers 2
# With model preloading (loads into GPU before accepting requests)
mlx-flash-server --port 8080 --model mlx-community/Qwen3-30B-A3B-4bit --preload
# JSON structured logs + file output
mlx-flash-server --port 8080 --log-format json --log-file /var/log/mlx-flash.log
# Connect to existing Python worker (don't launch one)
mlx-flash-server --port 8080 --no-launch-worker --python-port 8081What happens on startup:
- Auto-detects Python venv (
.venv*/in project,VIRTUAL_ENVenv, or systempython3) - Verifies
mlx_flash_compressis importable (clear error if not installed) - Launches N Python workers on ports
8081-808N - Health-checks each worker until ready (up to 15s, or 120s with
--preload) - Starts Rust proxy on
:8080— routes requests to workers - Background health checker every 10s — auto-restarts dead workers
Monitoring:
- Dashboard: http://localhost:8080/admin (live charts, worker management, logs)
- Chat: http://localhost:8080/chat
- Metrics: http://localhost:8080/metrics (Prometheus format)
- Grafana:
docker compose --profile monitoring up -d→ http://localhost:3000
Worker management (no restart needed):
curl -X POST http://localhost:8080/v1/models/switch -d '{"model":"mlx-community/Qwen3-8B-4bit"}'
curl -X POST http://localhost:8080/workers/restart -d '{"port":8081}'
curl -X POST http://localhost:8080/reload
curl -X POST http://localhost:8080/shutdown# Set cache size (MB)
export FLASH_CACHE_RAM_MB=8192
# Enable/disable features
export FLASH_ENABLE_PREFETCH=1
export FLASH_MIXED_PRECISION=1
export FLASH_SKIP_FALLBACK=0
python -m mlx_flash_compress.run --model <path>Create ~/.config/mlx-flash/config.json:
{
"cache": {
"enable": true,
"ram_mb": 0,
"eviction": "lcp",
"hot_algo": "lz4"
},
"prefetch": {
"enable": true,
"workers": 2
},
"mixed_precision": {
"enable": true,
"cold_bits": 2,
"hot_bits": 4
},
"skip_fallback": {
"enable": false
},
"ssd_protection": {
"enable": true,
"thermal_limit_c": 70
},
"engine": {
"backend": "auto"
}
}Set ram_mb to 0 for auto-detection (uses 80% of available memory with safety margin).
python -m pytest tests/ -v
# Expected: 89+ passedThe Rust sidecar provides faster memory monitoring, SSE streaming, and expert caching.
cargo build --release -p mlx-flash-server./mlx-flash-server/target/release/mlx-flash-server --launch-worker --preload --port 8080./mlx-flash-server/target/release/mlx-flash-server \
--launch-worker --preload \
--expert-dir /path/to/experts \
--cache-mb 512 \
--socket-path /tmp/mlx-flash-cache.sockdocker build -t mlx-flash .
docker run mlx-flash
# Runs synthetic benchmarks (MLX inference requires native macOS)"MLX not available": You need Apple Silicon Mac. Intel Macs don't support MLX.
"Model download fails": Set HF_TOKEN environment variable for Hugging Face authentication:
export HF_TOKEN=hf_your_token_here"libfastcache.dylib not found": Build it:
make -C csrc install"Out of memory": Reduce cache size:
python -m mlx_flash_compress.run --model <path> --cache-mb 2048The simplest way to use MLX-Flash:
python -m mlx_flash_compress.chatShows real-time memory status, tok/s per response, and warns when RAM is tight. Type /status to see memory info, /clear to reset conversation.
Start the OpenAI-compatible API server:
python -m mlx_flash_compress.serve --model mlx-community/Qwen3-30B-A3B-4bit --port 8080- Open LM Studio
- Go to Settings -> Server
- Set custom endpoint:
http://localhost:8080/v1 - Chat normally — our server handles inference + memory management
Add to your ~/.continue/config.json:
{
"models": [{
"title": "Local MoE",
"provider": "openai",
"model": "local",
"apiBase": "http://localhost:8080/v1",
"apiKey": "not-needed"
}]
}from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat API (OpenAI-compatible) |
/v1/models |
GET | List available models |
/status |
GET | Memory, pressure, cache stats |
/health |
GET | Health check |
Ollama uses llama.cpp as its backend, not MLX. Two options:
- Run our server alongside: Our API server at
:8080, Ollama at:11434. Use our server for MoE models that benefit from expert caching. - Ollama with MLX backend: If Ollama adds MLX support in the future, our memory management layer can integrate.
The system automatically monitors your Mac's RAM:
# Check memory status anytime during chat
/status
# Or via the API
curl http://localhost:8080/statusWhat it does:
- Monitors macOS memory pressure in real-time
- Auto-sizes expert cache based on available RAM (2GB safety margin)
- Warns when pressure is critical ("close apps to prevent slowdown")
- Suggests actions: which apps to close, whether to use a smaller model
For models that barely fit in RAM (the sweet spot):
Mixed precision automatically reduces the model's memory footprint by ~20%:
- Hot experts stay at 4-bit (full quality)
- Cold experts compressed to 2-bit (minimal quality impact)
- Result: a model at 0.9x RAM goes from 43 tok/s -> 104 tok/s (measured)
# Memory pressure analysis (the key measurement)
python -m mlx_flash_compress.bench_memory_pressure --tokens 50
# ISP-like warm-up demo (watch cache fill in real-time)
python -m mlx_flash_compress.demo_warmup --topics coding writing coding math
# Real model routing with cache simulation
python -m mlx_flash_compress.cached_inference --tokens 80 --multi-topic- Try different models to see scaling behavior
- Use
--task codingor--task writingfor task-specific optimization - Run
python -m mlx_flash_compress.tier_optimizerto find optimal settings - Check
docs/integrations.mdfor Claude Code, LM Studio, Cursor, Aider integration - Check
docs/technical-reference.mdfor deep implementation details