Skip to content

Commit 25e3dad

Browse files
committed
docs: update README with current architecture and benchmark language
Add ASCII art header, "Why does this matter?" section, and updated "How it works" with current streaming modes (expert-streaming, dense FFN-streaming), dynamic pool sizing, co-activation tracking, and 99.5% neuron cache hit rate.
1 parent 2e4073d commit 25e3dad

File tree

1 file changed

+47
-12
lines changed

1 file changed

+47
-12
lines changed

README.md

Lines changed: 47 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,57 @@
1-
# Hypura
2-
3-
**Run models too big for your Mac's memory.**
1+
```
2+
_ _
3+
| | | |_ _ _ __ _ _ _ __ __ _
4+
| |_| | | | | '_ \| | | | '__/ _` |
5+
| _ | |_| | |_) | |_| | | | (_| |
6+
|_| |_|\__, | .__/ \__,_|_| \__,_|
7+
|___/|_|
8+
Run models too big for your Mac's memory
9+
```
410

5-
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU (Metal), RAM, and NVMe based on access patterns and hardware bandwidth — enabling you to run models that exceed your available memory where vanilla llama.cpp would OOM or thrash to a halt.
11+
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon.
12+
It places model tensors across GPU, RAM, and NVMe tiers based on access
13+
patterns, bandwidth costs, and hardware capabilities — enabling models
14+
that exceed physical memory to run without crashing the system.
615

716
Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.
817

9-
## How it works
18+
## Why does this matter?
19+
20+
Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory
21+
and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load
22+
a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.
23+
24+
Hypura solves this by understanding the model architecture:
1025

11-
On Apple Silicon, GPU memory is unified with system RAM. When a model doesn't fit, llama.cpp either OOMs with GPU offloading or falls back to CPU-only with swap thrashing (~10+ min/token). Hypura adds a third option: **NVMe-tiered scheduling**.
26+
- **Norms and embeddings** are tiny but accessed every token — pinned to GPU
27+
- **MoE expert routing** exploits sparsity — only 2 of 8 experts fire per token.
28+
Router interception identifies selected experts in the eval callback, then loads
29+
only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks
30+
loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
31+
Co-activation tracking predicts which experts will fire next for speculative prefetch.
32+
- **Dense FFN weights** (gate, up, down — ~60% of model size) stream from NVMe through
33+
a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch
34+
lookahead depth scales automatically with available memory.
1235

13-
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:
36+
The result: models that would crash your machine under naive mmap become runnable.
37+
Models that fit in memory run at full Metal GPU speed with zero overhead.
38+
39+
## How it works
40+
41+
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth),
42+
and solves a placement optimization that assigns every tensor to a tier:
1443

1544
- **GPU (Metal)** — Attention layers, norms, embeddings. Fastest access, limited by `recommendedMaxWorkingSetSize`.
1645
- **RAM** — Overflow layers that don't fit in the GPU working set. Accessed via mmap.
1746
- **NVMe** — Remaining layers loaded on-demand via direct I/O (`F_NOCACHE` + `pread`), prefetched ahead of the forward pass.
1847

19-
Hypura selects the best inference mode automatically:
48+
Hypura selects the best inference mode automatically based on model size, architecture, and available memory:
2049

21-
- **Keep-resident**For small NVMe spill. Loads NVMe layers once, keeps them in memory. Zero disk I/O after the first pass.
50+
- **Full-resident**Model fits in GPU+RAM. No NVMe I/O. Full Metal speed.
2251
- **Expert-streaming** — For MoE models (Mixtral). Only non-expert tensors (~1 GB) stay on GPU. Expert tensors stream from NVMe through a pool buffer on demand, with a neuron cache (99.5% hit rate) that eliminates most I/O after warmup.
23-
- **Dense FFN-streaming** — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a pool buffer, with layer-ahead prefetch.
52+
- **Dense FFN-streaming** — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer, with scaled prefetch lookahead.
53+
54+
Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.
2455

2556
## Performance
2657

@@ -30,9 +61,9 @@ All benchmarks on **M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read
3061
|---|---|---|---|---|---|---|---|
3162
| Qwen 2.5 14B Q4_K_M | 8.4 GB | 8.4 GB || full-resident | **21 tok/s** | ~21 tok/s | Fits in GPU; no overhead |
3263
| Mixtral 8x7B Q5_K_M | 30.9 GB | 1.1 GB | 29.8 GB | expert-streaming | **2.2 tok/s** | **OOM** | All layers on Metal; 99.5% cache hit rate |
33-
| Llama 3.3 70B Q4_K_M | 39.6 GB | 7.8 GB | 31.8 GB | dense-FFN-streaming | **0.3 tok/s** | **OOM** | All layers on Metal; 3-layer prefetch lookahead |
64+
| Llama 3.3 70B Q4_K_M | 39.6 GB | 7.8 GB | 31.8 GB | dense-FFN-streaming | **0.3 tok/s** | **OOM** | All layers on Metal; dynamic 24-slot pool, 7-layer prefetch |
3465

35-
**Key takeaway:** For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B.
66+
**Key takeaway:** For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.
3667

3768
## Install
3869

@@ -155,3 +186,7 @@ Hypura is a Cargo workspace with two crates:
155186
## License
156187

157188
MIT
189+
190+
## Ethics
191+
192+
I feel morally obligated to say I did *not* write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.

0 commit comments

Comments
 (0)