Hypura v0.1.0
Run models too big for your Mac's memory. Hypura places model tensors across GPU (Metal), RAM, and NVMe based on access patterns and hardware bandwidth — enabling models that exceed physical memory to run where vanilla llama.cpp would OOM.
Benchmark Results (M1 Max, 32 GB)
| Model | Size | Mode | Hypura | llama.cpp |
|---|---|---|---|---|
| Qwen 2.5 14B Q4_K_M | 8.4 GB | full-resident | 21 tok/s | ~21 tok/s |
| Mixtral 8x7B Q5_K_M | 30.9 GB | expert-streaming | 2.2 tok/s | OOM |
| Llama 3.3 70B Q4_K_M | 39.6 GB | dense-FFN-streaming | 0.3 tok/s | OOM |
Key Features
- Automatic hardware profiling — detects GPU, RAM, and NVMe bandwidth; caches profile for 30 days
- LP + greedy tensor placement — optimally assigns tensors across GPU/RAM/NVMe tiers
- Expert-streaming (MoE) — streams only selected experts from NVMe; 99.5% neuron cache hit rate on Mixtral
- Dense FFN-streaming — streams FFN tensors from NVMe while attention stays on GPU; all layers on Metal
- Dynamic pool sizing — pool buffer and prefetch depth scale automatically with available memory
- Co-activation tracking — learns expert firing patterns across tokens, persists to disk for speculative prefetch
- Ollama-compatible server — drop-in replacement API (`/api/generate`, `/api/chat`)
- Unified memory budget — single source of truth for committed memory estimation (GPU, KV cache, Metal, OS)
Install
```sh
git clone --recurse-submodules https://github.com/t8/hypura.git
cd hypura
cargo build --release
```
Requires Rust 1.75+ and CMake. Apple Silicon only (M1/M2/M3/M4).