Skip to content

v0.1.0 — Storage-Tier-Aware LLM Inference

Latest

Choose a tag to compare

@t8 t8 released this 17 Mar 23:26
· 21 commits to main since this release

Hypura v0.1.0

Run models too big for your Mac's memory. Hypura places model tensors across GPU (Metal), RAM, and NVMe based on access patterns and hardware bandwidth — enabling models that exceed physical memory to run where vanilla llama.cpp would OOM.

Benchmark Results (M1 Max, 32 GB)

Model Size Mode Hypura llama.cpp
Qwen 2.5 14B Q4_K_M 8.4 GB full-resident 21 tok/s ~21 tok/s
Mixtral 8x7B Q5_K_M 30.9 GB expert-streaming 2.2 tok/s OOM
Llama 3.3 70B Q4_K_M 39.6 GB dense-FFN-streaming 0.3 tok/s OOM

Key Features

  • Automatic hardware profiling — detects GPU, RAM, and NVMe bandwidth; caches profile for 30 days
  • LP + greedy tensor placement — optimally assigns tensors across GPU/RAM/NVMe tiers
  • Expert-streaming (MoE) — streams only selected experts from NVMe; 99.5% neuron cache hit rate on Mixtral
  • Dense FFN-streaming — streams FFN tensors from NVMe while attention stays on GPU; all layers on Metal
  • Dynamic pool sizing — pool buffer and prefetch depth scale automatically with available memory
  • Co-activation tracking — learns expert firing patterns across tokens, persists to disk for speculative prefetch
  • Ollama-compatible server — drop-in replacement API (`/api/generate`, `/api/chat`)
  • Unified memory budget — single source of truth for committed memory estimation (GPU, KV cache, Metal, OS)

Install

```sh
git clone --recurse-submodules https://github.com/t8/hypura.git
cd hypura
cargo build --release
```

Requires Rust 1.75+ and CMake. Apple Silicon only (M1/M2/M3/M4).