Release v0.1.0 — Storage-Tier-Aware LLM Inference · t8/hypura

Hypura v0.1.0

Run models too big for your Mac's memory. Hypura places model tensors across GPU (Metal), RAM, and NVMe based on access patterns and hardware bandwidth — enabling models that exceed physical memory to run where vanilla llama.cpp would OOM.

Benchmark Results (M1 Max, 32 GB)

Model	Size	Mode	Hypura	llama.cpp
Qwen 2.5 14B Q4_K_M	8.4 GB	full-resident	21 tok/s	~21 tok/s
Mixtral 8x7B Q5_K_M	30.9 GB	expert-streaming	2.2 tok/s	OOM
Llama 3.3 70B Q4_K_M	39.6 GB	dense-FFN-streaming	0.3 tok/s	OOM

Key Features

Automatic hardware profiling — detects GPU, RAM, and NVMe bandwidth; caches profile for 30 days
LP + greedy tensor placement — optimally assigns tensors across GPU/RAM/NVMe tiers
Expert-streaming (MoE) — streams only selected experts from NVMe; 99.5% neuron cache hit rate on Mixtral
Dense FFN-streaming — streams FFN tensors from NVMe while attention stays on GPU; all layers on Metal
Dynamic pool sizing — pool buffer and prefetch depth scale automatically with available memory
Co-activation tracking — learns expert firing patterns across tokens, persists to disk for speculative prefetch
Ollama-compatible server — drop-in replacement API (`/api/generate`, `/api/chat`)
Unified memory budget — single source of truth for committed memory estimation (GPU, KV cache, Metal, OS)

Install

```sh
git clone --recurse-submodules https://github.com/t8/hypura.git
cd hypura
cargo build --release
```

Requires Rust 1.75+ and CMake. Apple Silicon only (M1/M2/M3/M4).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0 — Storage-Tier-Aware LLM Inference

Choose a tag to compare

Sorry, something went wrong.