Skip to content

Latest commit

Β 

History

History
483 lines (351 loc) Β· 16.8 KB

File metadata and controls

483 lines (351 loc) Β· 16.8 KB

Squish - Squeeze the Most Out of Your AI Models

License: MIT PyPI version CI Platform Discord HuggingFace

Squish Logo

Local LLM inference at sub-second load times.
Drop-in for OpenAI, Ollama, and any LLM client.
Web chat UI Β· Tool calling Β· Batch scheduler Β· CLI
No API key. No cloud. No data leaving your machine.
Free.

⚠️ macOS + Apple Silicon (M1–M5) only. Linux/CUDA support is on the roadmap. Windows is not planned.


Demo

Install

# Homebrew (recommended)
brew install wesleyscholl/squish/squish
# One-liner installer
curl -fsSL https://raw.githubusercontent.com/wesleyscholl/squish/main/install.sh | bash
# pip
pip install squish

Quick Start

squish catalog              # browse 29 available models
squish pull qwen3:8b        # download + compress once (~5 min)
squish run qwen3:8b        # start server on :11435

Then open http://localhost:11435/chat in any browser.

Or chat in the terminal:

squish chat qwen3:8b

Drop-in for any OpenAI or Ollama client:

export OPENAI_BASE_URL=http://localhost:11435/v1
export OPENAI_API_KEY=squish
# or
export OLLAMA_HOST=http://localhost:11435

All flags above are Stable (Waves 1–12). For advanced optimization flags ([Beta] Waves 13–18 and [Experimental] Waves 19+), see MODULES.md.


Why Not Ollama or LM Studio?

Ollama and LM Studio are great tools. Squish solves a different problem.

Ollama LM Studio Squish
Cold-start load time 8–25 s 10–30 s 0.33–0.53 s
RAM during load ~2–8 GB ~2–8 GB 160 MB ‑
OpenAI-compatible API βœ… βœ… βœ…
Ollama-compatible API βœ… βœ… βœ…
Web chat UI ❌ βœ… βœ…
Tool calling βœ… βœ… βœ…
Batch/concurrent requests limited ❌ βœ…
Works offline after pull βœ… βœ… βœ…
Download pre-squished weights N/A N/A βœ… (HuggingFace)
Apple Silicon–optimised βœ… βœ… βœ…
INT8 npy-dir format (mmap) ❌ ❌ βœ…
Source available βœ… ❌ βœ…

The key distinction: Ollama and LM Studio use standard GGUF/MLX weights that require full dtype-conversion on every boot. Squish stores weights in a Metal-native format that maps directly into unified memory β€” no conversion, sub-second every time.

‑ 160 MB = Apple Metal virtual-address delta during the load phase (mmap, no CPU heap allocation). Peak RSS during full initialization is ~402 MB. Both figures measured on Apple Silicon M-series.


The Numbers That Matter

Model: Qwen2.5-1.5B-Instruct Β· Hardware: Apple Silicon M-series, MLX framework

Cold mlx_lm load† Reference (mlx_lm) Squish (cached)
Load time 28.81s 1.96s 0.53s
RAM during load ~2400 MB ~2400 MB 160 MB
Peak load RAM ~2600 MB ~2600 MB 402 MB
Token cost $0 (local) $0 (local) $0
Original .safetensors needed? βœ… mandatory βœ… mandatory ❌ not needed

†Cold = OS page cache cold, first process start.
Squish cached = after one-time 19s conversion; all subsequent runs.

54Γ— faster cold load. 15Γ— less RAM. Statistically identical outputs.

Load time comparison: cold mlx_lm vs reference vs Squish cached
Figure 1 β€” Cold-start load time comparison across three configurations

RAM usage comparison
Figure 2 β€” Peak RAM during model load


The Problem

Every model you download ships in .safetensors β€” a format designed to move weights between training clusters. It was never designed as a local runtime format.

When mlx_lm.load() runs, it:

  1. Allocates ~2.4 GB into CPU heap even though Apple Silicon has unified memory
  2. Converts every tensor from storage dtype to runtime dtype β€” every single boot
  3. Makes you wait 28 seconds before the first token β€” for data that never changes

Squish fixes all three by decoupling storage from runtime. The original files are not needed after the first run. Delete them.


How It Works

FIRST RUN (~5-10 min β€” one-time per machine, done automatically by `squish pull`)
HuggingFace MLX weights ──► Squish INT8 compress ──► npy-dir on disk
                                      β”‚
                                      └──► squish_weights.safetensors  (bf16, MLX-native)

ALL SUBSEQUENT RUNS (0.53s cold / 0.33s warm)
squish_weights.safetensors ──► mx.load() ──► Metal GPU map ──► model ready

No CPU heap allocation. No dtype conversion. Direct Metal virtual-address mapping.

Three-Tier Cache

Tier File Load time
0 INT8 .npy tensors (Vectro compressed) ~19s
1 finalized/*.npy (float16, per-tensor) ~4.5s
2 squish_weights.safetensors (bf16 MLX) 0.33–0.53s

Squish three-tier cache architecture
Figure 4 β€” Squish three-tier weight cache architecture


Benchmark Accuracy

Evaluated with EleutherAI lm-evaluation-harness β€” the framework behind the Open LLM Leaderboard.

Task Reference Squish Ξ” Pass
ARC-Easy (acc_norm) 74.5% 73.5% -1.0% βœ…
HellaSwag (acc_norm) 63.5% 62.0% -1.5% βœ…
Winogrande (acc) 65.5% 67.0% +1.5% βœ…
PIQA (acc_norm) 77.5% 76.5% -1.0% βœ…

Pass criterion: ≀2% delta (well within measurement noise at 200 samples).
Winogrande improved by 1.5% β€” INT8 quantisation noise is uncorrelated with task variance.

Full reproducibility commands and multi-seed results are in docs/RESULTS.md.

Benchmark accuracy across multiple models
Figure 3 β€” Accuracy delta vs fp16 baseline across benchmarks and models


Drop-In API Server

Replace every cloud API call today. Start the server once; use it forever.

# Recommended: use the CLI
squish run 7b           # port 11435 by default

# Advanced: direct invocation
python3 -m squish.server \
    --model-dir      ~/models/Qwen2.5-7B-Instruct-bf16 \
    --compressed-dir ~/models/Qwen2.5-7B-Instruct-bf16-compressed \
    --port 11435

Key server flags (squish run --help for the full list):

Flag Values Default Purpose
--kv-cache-mode fp16 Β· int8 Β· snap fp16 KV cache compression; int8 saves RAM on long contexts via KIVI INT8 + FP16 recent window; snap adds SnapKV importance-based eviction
--kv-cache-window integer 64 FP16 recent-token window size for int8/snap modes
--kv-cache-budget integer 4096 Max K/V positions retained in snap mode
--log-level warning Β· info Β· debug warning Uvicorn log verbosity

Key compress flags (squish compress --help):

Flag Default Purpose
--awq off Run AWQ activation calibration before INT8/INT4 compression
--awq-samples N 20 Calibration samples for AWQ (more β†’ better accuracy, slower)
--int4 off INT4 nibble-packed output (~44% disk savings vs INT8). ⚠ Not recommended for models < 3B β€” use INT8 for best quality on small models.
--zstd-level N 0 Optional zstd entropy pass after quantisation (level 3 recommended)

Point any OpenAI client at it β€” no code changes:

import openai

client = openai.OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="squish",   # value ignored; no auth locally
)

# Streaming works
for chunk in client.chat.completions.create(
    model="Qwen2.5-1.5B-Instruct-bf16",
    messages=[{"role": "user", "content": "Explain attention mechanisms."}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Works with: Python openai β‰₯1.0, LangChain, LlamaIndex, Continue.dev, Cursor, any client that speaks the OpenAI wire protocol.

Server Endpoints

Endpoint Status
POST /v1/chat/completions βœ… streaming + non-streaming + tool calls
POST /v1/completions βœ… legacy text completion
GET /v1/models βœ… model listing
GET /health βœ… liveness probe
GET /v1/metrics βœ… throughput Β· queue depth Β· memory
POST /v1/embeddings βœ… mean-pool L2-normalised
GET /chat βœ… Web chat UI (browser)
POST /api/chat βœ… Ollama-compatible ndjson
POST /api/generate βœ… Ollama-compatible ndjson
GET /api/tags βœ… Ollama model listing
GET /api/version βœ… Ollama version handshake
POST /api/embeddings βœ… Ollama-compatible embeddings

Web Chat UI

Open http://localhost:11435/chat in any browser after starting the server.

  • Dark-themed, single-page app β€” no external services, works fully offline
  • Streaming responses with live token rendering (marked.js + highlight.js)
  • Conversation history persisted in localStorage (multi-session sidebar)
  • Model selector auto-populated from /v1/models
  • System prompt editor, settings panel (temp / top_p / max_tokens / seed)
  • Copy buttons on all code blocks

Ollama Drop-In

Squish mounts the full Ollama HTTP API at /api/*. Any tool that speaks Ollama will work against Squish with a single env-var change and zero code changes.

# Point any Ollama client at Squish
export OLLAMA_HOST=http://localhost:11435

# Works with the official Ollama CLI
ollama list
ollama run squish   # uses /api/generate under the hood

# Works with Continue.dev, Open WebUI, Enchanted, Msty, etc.
# Works with the official ollama Python library
import ollama

client = ollama.Client(host="http://localhost:11435")
response = client.chat(
    model="Qwen2.5-7B-Instruct-bf16",
    messages=[{"role": "user", "content": "What is entropy coding?"}],
)
print(response["message"]["content"])

Tool / Function Calling

/v1/chat/completions accepts OpenAI-format tools and returns tool_calls in the response. Squish injects the JSON schema into the system prompt (Qwen2.5 style) and parses the structured output automatically.

import openai, json

client = openai.OpenAI(base_url="http://localhost:11435/v1", api_key="squish")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
}]

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct-bf16",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
)

if response.choices[0].finish_reason == "tool_calls":
    call = response.choices[0].message.tool_calls[0]
    args = json.loads(call.function.arguments)
    print(f"Tool: {call.function.name}, Args: {args}")
    # β†’ Tool: get_weather, Args: {'city': 'Tokyo', 'unit': 'celsius'}

Integrations

Ready-made config templates live in configs/. Start Squish once, then point any of these tools at it β€” no cloud API key needed for any of them.

Continue.dev (VS Code / JetBrains AI assistant)

# Copy config to Continue.dev's config directory
cp configs/continue.json ~/.continue/config.json
squish run 7b
# Re-open VS Code β†’ Continue sidebar β†’ Squish model appears automatically

aider (AI pair programming in the terminal)

pip install aider-chat
squish run 7b

# Use the bundled config
aider --config configs/aider.yml

# Or install globally
cp configs/aider.yml ~/.aider.conf.yml
aider   # picks up config automatically

LiteLLM (unified proxy β€” route multiple providers through one endpoint)

pip install litellm
squish run 7b

litellm --config configs/litellm.yaml --port 4000
# β†’ all OpenAI clients pointing at localhost:4000 now use Squish

Open WebUI / Enchanted / Msty (Ollama-compatible frontends)

Set the Ollama host to http://localhost:11435 β€” all Ollama-compatible UIs work out of the box with zero additional configuration.


Advanced Features

Beyond the core stable feature set, Squish includes a large library of inference optimisations.

Tier Description Label
Stable Core inference (Waves 1–12). All flags validated on Apple Silicon M-series. (no label)
Beta Advanced KV compression, speculative decoding variants (Waves 13–18). Functionally complete; hardware validation in progress. [Beta]
Experimental Cutting-edge research features (Waves 19+). Proof-of-concept implementations; may change. [Experimental]

Stable (validated on hardware): INT8/INT4 compression, KV cache compression (KIVI + SnapKV), speculative decoding, AWQ calibration, prefix/radix cache, batch scheduler, streaming, paged attention, Flash Attention, Ollama drop-in, tool calling.

Beta: Advanced KV compression (ShadowKV, PQCache, YOCO, DiffKV), additional speculative decode variants (EAGLE3, MEDUSA, KnapSpec), attention architectures (SageAttention2, GQA, ChunkedPrefill).

Experimental: Cutting-edge attention (FlashMLA, NativeSparseAttn), extended quantisation (VPTQ, FP8, MXQuant, TernaryQuant), long-context optimisations (DualChunkAttn, MInference).

See MODULES.md for the full flag reference with one-line descriptions of every supported optimisation, categorised by stability tier.


Community

  • Discord β€” get help, share benchmarks, discuss models
  • GitHub Discussions β€” Q&A, ideas, show & tell
  • HuggingFace β€” pre-squished model weights (no local compression needed)
  • Contributing β€” good first issues, dev setup, PR guidelines

Requirements

  • macOS Β· Apple Silicon (M1–M5)
  • Python 3.10+ (3.12 recommended)
  • Dependencies install automatically via pip install squish
  • Core: mlx-lm, numpy, transformers, fastapi, uvicorn[standard], safetensors, zstandard, aiofiles, huggingface-hub
  • Eval extras: pip install squish[eval] adds lm-eval, datasets, accelerate
  • Optional: Rust quantizer (squish_quant_rs/) for 4–6Γ— faster compression throughput

Weight Fidelity

Metric Value
Mean cosine similarity 0.99999
Min cosine similarity 0.99995
First-token agreement 5/5 test prompts
Tensors quantised (INT8) 249 / 338
Tensors passthrough (fp16) 89 / 338

Embeddings, layer norms, and lm_head are stored as passthrough float16.
Zero quantisation error on the prediction path.


Novelty

The prior work: BitStack (ICLR 2025), Huff-LLM (Feb 2025), DFloat11, NeuZip.
None of them work on Apple Silicon. None serve an OpenAI-compatible API.
None achieve sub-second loads from a compressed format.

MLX GitHub issue #3043 (January 2026) β€” an open feature request to add entropy coding to MLX β€” is the clearest signal this gap exists and is unsolved.

Search "compressed weight" "MLX" inference "no decompression" "Apple Silicon" β€” zero results.


The Summary Worth Citing

Squish INT8 compression achieves accuracy statistically equivalent to fp16 baseline
across four standard reasoning benchmarks (ARC-Easy, HellaSwag, Winogrande, PIQA),
while reducing cold-start load time by 54Γ— and peak load RAM by 6Γ—.
The compressed format requires zero access to the original model files
after a one-time per-device conversion.

The numbers are real. Run it yourself.