Local LLM inference at sub-second load times.
Drop-in for OpenAI, Ollama, and any LLM client.
Web chat UI Β· Tool calling Β· Batch scheduler Β· CLI
No API key. No cloud. No data leaving your machine.
Free.
β οΈ macOS + Apple Silicon (M1βM5) only. Linux/CUDA support is on the roadmap. Windows is not planned.
# Homebrew (recommended)
brew install wesleyscholl/squish/squish# One-liner installer
curl -fsSL https://raw.githubusercontent.com/wesleyscholl/squish/main/install.sh | bash# pip
pip install squishsquish catalog # browse 29 available models
squish pull qwen3:8b # download + compress once (~5 min)
squish run qwen3:8b # start server on :11435Then open http://localhost:11435/chat in any browser.
Or chat in the terminal:
squish chat qwen3:8bDrop-in for any OpenAI or Ollama client:
export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
# or
export OLLAMA_HOST=http://localhost:11435All flags above are Stable (Waves 1β12). For advanced optimization flags
([Beta] Waves 13β18 and [Experimental] Waves 19+), see MODULES.md.
Ollama and LM Studio are great tools. Squish solves a different problem.
| Ollama | LM Studio | Squish | |
|---|---|---|---|
| Cold-start load time | 8β25 s | 10β30 s | 0.33β0.53 s |
| RAM during load | ~2β8 GB | ~2β8 GB | 160 MB β‘ |
| OpenAI-compatible API | β | β | β |
| Ollama-compatible API | β | β | β |
| Web chat UI | β | β | β |
| Tool calling | β | β | β |
| Batch/concurrent requests | limited | β | β |
| Works offline after pull | β | β | β |
| Download pre-squished weights | N/A | N/A | β (HuggingFace) |
| Apple Siliconβoptimised | β | β | β |
| INT8 npy-dir format (mmap) | β | β | β |
| Source available | β | β | β |
The key distinction: Ollama and LM Studio use standard GGUF/MLX weights that require full dtype-conversion on every boot. Squish stores weights in a Metal-native format that maps directly into unified memory β no conversion, sub-second every time.
β‘ 160 MB = Apple Metal virtual-address delta during the load phase (mmap, no CPU heap allocation). Peak RSS during full initialization is ~402 MB. Both figures measured on Apple Silicon M-series.
Model: Qwen2.5-1.5B-Instruct Β· Hardware: Apple Silicon M-series, MLX framework
Cold mlx_lm loadβ |
Reference (mlx_lm) |
Squish (cached) | |
|---|---|---|---|
| Load time | 28.81s | 1.96s | 0.53s |
| RAM during load | ~2400 MB | ~2400 MB | 160 MB |
| Peak load RAM | ~2600 MB | ~2600 MB | 402 MB |
| Token cost | $0 (local) | $0 (local) | $0 |
| Original .safetensors needed? | β mandatory | β mandatory | β not needed |
β Cold = OS page cache cold, first process start.
Squish cached = after one-time 19s conversion; all subsequent runs.
54Γ faster cold load. 15Γ less RAM. Statistically identical outputs.
Figure 1 β Cold-start load time comparison across three configurations
Figure 2 β Peak RAM during model load
Every model you download ships in .safetensors β a format designed to move
weights between training clusters. It was never designed as a local runtime format.
When mlx_lm.load() runs, it:
- Allocates ~2.4 GB into CPU heap even though Apple Silicon has unified memory
- Converts every tensor from storage dtype to runtime dtype β every single boot
- Makes you wait 28 seconds before the first token β for data that never changes
Squish fixes all three by decoupling storage from runtime. The original files are not needed after the first run. Delete them.
FIRST RUN (~5-10 min β one-time per machine, done automatically by `squish pull`)
HuggingFace MLX weights βββΊ Squish INT8 compress βββΊ npy-dir on disk
β
ββββΊ squish_weights.safetensors (bf16, MLX-native)
ALL SUBSEQUENT RUNS (0.53s cold / 0.33s warm)
squish_weights.safetensors βββΊ mx.load() βββΊ Metal GPU map βββΊ model ready
No CPU heap allocation. No dtype conversion. Direct Metal virtual-address mapping.
| Tier | File | Load time |
|---|---|---|
| 0 | INT8 .npy tensors (Vectro compressed) |
~19s |
| 1 | finalized/*.npy (float16, per-tensor) |
~4.5s |
| 2 | squish_weights.safetensors (bf16 MLX) |
0.33β0.53s |
Figure 4 β Squish three-tier weight cache architecture
Evaluated with EleutherAI lm-evaluation-harness β the framework behind the Open LLM Leaderboard.
| Task | Reference | Squish | Ξ | Pass |
|---|---|---|---|---|
| ARC-Easy (acc_norm) | 74.5% | 73.5% | -1.0% | β |
| HellaSwag (acc_norm) | 63.5% | 62.0% | -1.5% | β |
| Winogrande (acc) | 65.5% | 67.0% | +1.5% | β |
| PIQA (acc_norm) | 77.5% | 76.5% | -1.0% | β |
Pass criterion: β€2% delta (well within measurement noise at 200 samples).
Winogrande improved by 1.5% β INT8 quantisation noise is uncorrelated with task variance.
Full reproducibility commands and multi-seed results are in docs/RESULTS.md.
Figure 3 β Accuracy delta vs fp16 baseline across benchmarks and models
Replace every cloud API call today. Start the server once; use it forever.
# Recommended: use the CLI
squish run 7b # port 11435 by default
# Advanced: direct invocation
python3 -m squish.server \
--model-dir ~/models/Qwen2.5-7B-Instruct-bf16 \
--compressed-dir ~/models/Qwen2.5-7B-Instruct-bf16-compressed \
--port 11435Key server flags (squish run --help for the full list):
| Flag | Values | Default | Purpose |
|---|---|---|---|
--kv-cache-mode |
fp16 Β· int8 Β· snap |
fp16 |
KV cache compression; int8 saves RAM on long contexts via KIVI INT8 + FP16 recent window; snap adds SnapKV importance-based eviction |
--kv-cache-window |
integer | 64 |
FP16 recent-token window size for int8/snap modes |
--kv-cache-budget |
integer | 4096 |
Max K/V positions retained in snap mode |
--log-level |
warning Β· info Β· debug |
warning |
Uvicorn log verbosity |
Key compress flags (squish compress --help):
| Flag | Default | Purpose |
|---|---|---|
--awq |
off | Run AWQ activation calibration before INT8/INT4 compression |
--awq-samples N |
20 |
Calibration samples for AWQ (more β better accuracy, slower) |
--int4 |
off | INT4 nibble-packed output (~44% disk savings vs INT8). β Not recommended for models < 3B β use INT8 for best quality on small models. |
--zstd-level N |
0 |
Optional zstd entropy pass after quantisation (level 3 recommended) |
Point any OpenAI client at it β no code changes:
import openai
client = openai.OpenAI(
base_url="http://localhost:11435/v1",
api_key="squish", # value ignored; no auth locally
)
# Streaming works
for chunk in client.chat.completions.create(
model="Qwen2.5-1.5B-Instruct-bf16",
messages=[{"role": "user", "content": "Explain attention mechanisms."}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="", flush=True)Works with: Python openai β₯1.0, LangChain, LlamaIndex, Continue.dev, Cursor,
any client that speaks the OpenAI wire protocol.
| Endpoint | Status |
|---|---|
POST /v1/chat/completions |
β streaming + non-streaming + tool calls |
POST /v1/completions |
β legacy text completion |
GET /v1/models |
β model listing |
GET /health |
β liveness probe |
GET /v1/metrics |
β throughput Β· queue depth Β· memory |
POST /v1/embeddings |
β mean-pool L2-normalised |
GET /chat |
β Web chat UI (browser) |
POST /api/chat |
β Ollama-compatible ndjson |
POST /api/generate |
β Ollama-compatible ndjson |
GET /api/tags |
β Ollama model listing |
GET /api/version |
β Ollama version handshake |
POST /api/embeddings |
β Ollama-compatible embeddings |
Open http://localhost:11435/chat in any browser after starting the server.
- Dark-themed, single-page app β no external services, works fully offline
- Streaming responses with live token rendering (marked.js + highlight.js)
- Conversation history persisted in
localStorage(multi-session sidebar) - Model selector auto-populated from
/v1/models - System prompt editor, settings panel (temp / top_p / max_tokens / seed)
- Copy buttons on all code blocks
Squish mounts the full Ollama HTTP API at /api/*. Any tool that speaks Ollama
will work against Squish with a single env-var change and zero code changes.
# Point any Ollama client at Squish
export OLLAMA_HOST=http://localhost:11435
# Works with the official Ollama CLI
ollama list
ollama run squish # uses /api/generate under the hood
# Works with Continue.dev, Open WebUI, Enchanted, Msty, etc.# Works with the official ollama Python library
import ollama
client = ollama.Client(host="http://localhost:11435")
response = client.chat(
model="Qwen2.5-7B-Instruct-bf16",
messages=[{"role": "user", "content": "What is entropy coding?"}],
)
print(response["message"]["content"])/v1/chat/completions accepts OpenAI-format tools and returns tool_calls
in the response. Squish injects the JSON schema into the system prompt (Qwen2.5
style) and parses the structured output automatically.
import openai, json
client = openai.OpenAI(base_url="http://localhost:11435/v1", api_key="squish")
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}]
response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct-bf16",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
)
if response.choices[0].finish_reason == "tool_calls":
call = response.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)
print(f"Tool: {call.function.name}, Args: {args}")
# β Tool: get_weather, Args: {'city': 'Tokyo', 'unit': 'celsius'}Ready-made config templates live in configs/. Start Squish once, then point
any of these tools at it β no cloud API key needed for any of them.
# Copy config to Continue.dev's config directory
cp configs/continue.json ~/.continue/config.json
squish run 7b
# Re-open VS Code β Continue sidebar β Squish model appears automaticallypip install aider-chat
squish run 7b
# Use the bundled config
aider --config configs/aider.yml
# Or install globally
cp configs/aider.yml ~/.aider.conf.yml
aider # picks up config automaticallypip install litellm
squish run 7b
litellm --config configs/litellm.yaml --port 4000
# β all OpenAI clients pointing at localhost:4000 now use SquishSet the Ollama host to http://localhost:11435 β all Ollama-compatible UIs work
out of the box with zero additional configuration.
Beyond the core stable feature set, Squish includes a large library of inference optimisations.
| Tier | Description | Label |
|---|---|---|
| Stable | Core inference (Waves 1β12). All flags validated on Apple Silicon M-series. | (no label) |
| Beta | Advanced KV compression, speculative decoding variants (Waves 13β18). Functionally complete; hardware validation in progress. | [Beta] |
| Experimental | Cutting-edge research features (Waves 19+). Proof-of-concept implementations; may change. | [Experimental] |
Stable (validated on hardware): INT8/INT4 compression, KV cache compression (KIVI + SnapKV), speculative decoding, AWQ calibration, prefix/radix cache, batch scheduler, streaming, paged attention, Flash Attention, Ollama drop-in, tool calling.
Beta: Advanced KV compression (ShadowKV, PQCache, YOCO, DiffKV), additional speculative decode variants (EAGLE3, MEDUSA, KnapSpec), attention architectures (SageAttention2, GQA, ChunkedPrefill).
Experimental: Cutting-edge attention (FlashMLA, NativeSparseAttn), extended quantisation (VPTQ, FP8, MXQuant, TernaryQuant), long-context optimisations (DualChunkAttn, MInference).
See MODULES.md for the full flag reference with one-line descriptions of every supported optimisation, categorised by stability tier.
- Discord β get help, share benchmarks, discuss models
- GitHub Discussions β Q&A, ideas, show & tell
- HuggingFace β pre-squished model weights (no local compression needed)
- Contributing β good first issues, dev setup, PR guidelines
- macOS Β· Apple Silicon (M1βM5)
- Python 3.10+ (3.12 recommended)
- Dependencies install automatically via
pip install squish - Core:
mlx-lm,numpy,transformers,fastapi,uvicorn[standard],safetensors,zstandard,aiofiles,huggingface-hub - Eval extras:
pip install squish[eval]addslm-eval,datasets,accelerate - Optional: Rust quantizer (
squish_quant_rs/) for 4β6Γ faster compression throughput
| Metric | Value |
|---|---|
| Mean cosine similarity | 0.99999 |
| Min cosine similarity | 0.99995 |
| First-token agreement | 5/5 test prompts |
| Tensors quantised (INT8) | 249 / 338 |
| Tensors passthrough (fp16) | 89 / 338 |
Embeddings, layer norms, and lm_head are stored as passthrough float16.
Zero quantisation error on the prediction path.
The prior work: BitStack (ICLR 2025), Huff-LLM (Feb 2025), DFloat11, NeuZip.
None of them work on Apple Silicon. None serve an OpenAI-compatible API.
None achieve sub-second loads from a compressed format.
MLX GitHub issue #3043 (January 2026) β an open feature request to add entropy coding to MLX β is the clearest signal this gap exists and is unsolved.
Search "compressed weight" "MLX" inference "no decompression" "Apple Silicon" β zero results.
Squish INT8 compression achieves accuracy statistically equivalent to fp16 baseline
across four standard reasoning benchmarks (ARC-Easy, HellaSwag, Winogrande, PIQA),
while reducing cold-start load time by 54Γ and peak load RAM by 6Γ.
The compressed format requires zero access to the original model files
after a one-time per-device conversion.
The numbers are real. Run it yourself.

