Getting Started

Prerequisites

Mac with Apple Silicon (M1, M2, M3, M4, or M5 — any variant)
macOS 14+ (Sonoma or newer)
Python 3.10+
At least 16GB RAM (more = better performance)

Installation (2 minutes)

# Clone the repo
git clone https://github.com/szibis/MLX-Flash.git
cd MLX-Flash

# Create virtual environment
uv venv && source .venv/bin/activate

# Install dependencies
uv pip install lz4 zstandard numpy psutil tabulate pytest mlx mlx-lm

# Build C acceleration library (optional but recommended)
make -C csrc install

Your First Run

1. Check your hardware

python -m mlx_flash_compress.hardware

This shows your Mac's specs and what models you can run:

  Detected: Apple M3 Max, 36GB RAM, 1TB SSD

  Model                           Fits?  Hit%   tok/s
  Qwen MoE (5GB)                  YES    100%   115
  Mixtral-8x7B (26GB)             YES    100%    16
  DeepSeek-V3 (170GB)              NO     68%   3.7

2. Run with a model

# Small model (downloads ~5GB, fits in RAM)
python -m mlx_flash_compress.run \
  --model mlx-community/Qwen3-30B-A3B-4bit \
  --tokens 100

# With task-specific optimization
python -m mlx_flash_compress.run \
  --model mlx-community/Qwen3-30B-A3B-4bit \
  --task coding \
  --tokens 100

# With adaptive profiling (learns what you need)
python -m mlx_flash_compress.run \
  --model mlx-community/Qwen3-30B-A3B-4bit \
  --adaptive \
  --tokens 200

3. Find your optimal configuration

# For a specific model size on your hardware
python -m mlx_flash_compress.tier_optimizer \
  --total-ram 36 --model-gb 209 --layers 60 --experts 512

# Output: optimal RAM/SSD split, expected tok/s, cache hit rate

flowchart LR
    A[Install] --> B[Check Hardware]
    B --> C{Model fits?}
    C -->|Yes, easily| D[python -m mlx_flash_compress.chat]
    C -->|Barely fits| E[Enable mixed precision]
    C -->|Too large| F[Enable SSD streaming]
    E --> D
    F --> G[python -m mlx_flash_compress.serve]

Running the Server

The Rust binary is a single entry point that manages everything:

# Simplest — auto-selects model, launches Python worker, serves on :8080
mlx-flash-server --port 8080

# Specify model + number of workers
mlx-flash-server --port 8080 --model mlx-community/Qwen3-30B-A3B-4bit --workers 2

# With model preloading (loads into GPU before accepting requests)
mlx-flash-server --port 8080 --model mlx-community/Qwen3-30B-A3B-4bit --preload

# JSON structured logs + file output
mlx-flash-server --port 8080 --log-format json --log-file /var/log/mlx-flash.log

# Connect to existing Python worker (don't launch one)
mlx-flash-server --port 8080 --no-launch-worker --python-port 8081

What happens on startup:

Auto-detects Python venv (.venv*/ in project, VIRTUAL_ENV env, or system python3)
Verifies mlx_flash_compress is importable (clear error if not installed)
Launches N Python workers on ports 8081-808N
Health-checks each worker until ready (up to 15s, or 120s with --preload)
Starts Rust proxy on :8080 — routes requests to workers
Background health checker every 10s — auto-restarts dead workers

Monitoring:

Dashboard: http://localhost:8080/admin (live charts, worker management, logs)
Chat: http://localhost:8080/chat
Metrics: http://localhost:8080/metrics (Prometheus format)
Grafana: docker compose --profile monitoring up -d → http://localhost:3000

Worker management (no restart needed):

curl -X POST http://localhost:8080/v1/models/switch -d '{"model":"mlx-community/Qwen3-8B-4bit"}'
curl -X POST http://localhost:8080/workers/restart -d '{"port":8081}'
curl -X POST http://localhost:8080/reload
curl -X POST http://localhost:8080/shutdown

Configuration

Quick: Environment variables

# Set cache size (MB)
export FLASH_CACHE_RAM_MB=8192

# Enable/disable features
export FLASH_ENABLE_PREFETCH=1
export FLASH_MIXED_PRECISION=1
export FLASH_SKIP_FALLBACK=0

python -m mlx_flash_compress.run --model <path>

Full: Config file

Create ~/.config/mlx-flash/config.json:

{
  "cache": {
    "enable": true,
    "ram_mb": 0,
    "eviction": "lcp",
    "hot_algo": "lz4"
  },
  "prefetch": {
    "enable": true,
    "workers": 2
  },
  "mixed_precision": {
    "enable": true,
    "cold_bits": 2,
    "hot_bits": 4
  },
  "skip_fallback": {
    "enable": false
  },
  "ssd_protection": {
    "enable": true,
    "thermal_limit_c": 70
  },
  "engine": {
    "backend": "auto"
  }
}

Set ram_mb to 0 for auto-detection (uses 80% of available memory with safety margin).

Running Tests

python -m pytest tests/ -v
# Expected: 89+ passed

Rust Sidecar (optional, for production)

The Rust sidecar provides faster memory monitoring, SSE streaming, and expert caching.

Build

cargo build --release -p mlx-flash-server

Run

./mlx-flash-server/target/release/mlx-flash-server --launch-worker --preload --port 8080

With expert caching

./mlx-flash-server/target/release/mlx-flash-server \
  --launch-worker --preload \
  --expert-dir /path/to/experts \
  --cache-mb 512 \
  --socket-path /tmp/mlx-flash-cache.sock

Docker (for CI/testing only)

docker build -t mlx-flash .
docker run mlx-flash
# Runs synthetic benchmarks (MLX inference requires native macOS)

Troubleshooting

"MLX not available": You need Apple Silicon Mac. Intel Macs don't support MLX.

"Model download fails": Set HF_TOKEN environment variable for Hugging Face authentication:

export HF_TOKEN=hf_your_token_here

"libfastcache.dylib not found": Build it:

make -C csrc install

"Out of memory": Reduce cache size:

python -m mlx_flash_compress.run --model <path> --cache-mb 2048

Interactive Chat

The simplest way to use MLX-Flash:

python -m mlx_flash_compress.chat

Shows real-time memory status, tok/s per response, and warns when RAM is tight. Type /status to see memory info, /clear to reset conversation.

API Server (LM Studio, continue.dev, OpenAI SDK)

Start the OpenAI-compatible API server:

python -m mlx_flash_compress.serve --model mlx-community/Qwen3-30B-A3B-4bit --port 8080

Connect from LM Studio

Open LM Studio
Go to Settings -> Server
Set custom endpoint: http://localhost:8080/v1
Chat normally — our server handles inference + memory management

Connect from continue.dev (VS Code)

Add to your ~/.continue/config.json:

{
  "models": [{
    "title": "Local MoE",
    "provider": "openai",
    "model": "local",
    "apiBase": "http://localhost:8080/v1",
    "apiKey": "not-needed"
  }]
}

Connect from any OpenAI SDK client

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Server endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat API (OpenAI-compatible)
`/v1/models`	GET	List available models
`/status`	GET	Memory, pressure, cache stats
`/health`	GET	Health check

Using with Ollama

Ollama uses llama.cpp as its backend, not MLX. Two options:

Run our server alongside: Our API server at :8080, Ollama at :11434. Use our server for MoE models that benefit from expert caching.
Ollama with MLX backend: If Ollama adds MLX support in the future, our memory management layer can integrate.

Memory Management

The system automatically monitors your Mac's RAM:

# Check memory status anytime during chat
/status

# Or via the API
curl http://localhost:8080/status

What it does:

Monitors macOS memory pressure in real-time
Auto-sizes expert cache based on available RAM (2GB safety margin)
Warns when pressure is critical ("close apps to prevent slowdown")
Suggests actions: which apps to close, whether to use a smaller model

For models that barely fit in RAM (the sweet spot):

Mixed precision automatically reduces the model's memory footprint by ~20%:

Hot experts stay at 4-bit (full quality)
Cold experts compressed to 2-bit (minimal quality impact)
Result: a model at 0.9x RAM goes from 43 tok/s -> 104 tok/s (measured)

Benchmarks

# Memory pressure analysis (the key measurement)
python -m mlx_flash_compress.bench_memory_pressure --tokens 50

# ISP-like warm-up demo (watch cache fill in real-time)
python -m mlx_flash_compress.demo_warmup --topics coding writing coding math

# Real model routing with cache simulation
python -m mlx_flash_compress.cached_inference --tokens 80 --multi-topic

What's Next

Try different models to see scaling behavior
Use --task coding or --task writing for task-specific optimization
Run python -m mlx_flash_compress.tier_optimizer to find optimal settings
Check docs/integrations.md for Claude Code, LM Studio, Cursor, Aider integration
Check docs/technical-reference.md for deep implementation details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Prerequisites

Installation (2 minutes)

Your First Run

1. Check your hardware

2. Run with a model

3. Find your optimal configuration

Running the Server

Configuration

Quick: Environment variables

Full: Config file

Running Tests

Rust Sidecar (optional, for production)

Build

Run

With expert caching

Docker (for CI/testing only)

Troubleshooting

Interactive Chat

API Server (LM Studio, continue.dev, OpenAI SDK)

Connect from LM Studio

Connect from continue.dev (VS Code)

Connect from any OpenAI SDK client

Server endpoints

Using with Ollama

Memory Management

Benchmarks

What's Next

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started

Prerequisites

Installation (2 minutes)

Your First Run

1. Check your hardware

2. Run with a model

3. Find your optimal configuration

Running the Server

Configuration

Quick: Environment variables

Full: Config file

Running Tests

Rust Sidecar (optional, for production)

Build

Run

With expert caching

Docker (for CI/testing only)

Troubleshooting

Interactive Chat

API Server (LM Studio, continue.dev, OpenAI SDK)

Connect from LM Studio

Connect from continue.dev (VS Code)

Connect from any OpenAI SDK client

Server endpoints

Using with Ollama

Memory Management

Benchmarks

What's Next