You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: update README with current architecture and benchmark language
Add ASCII art header, "Why does this matter?" section, and updated
"How it works" with current streaming modes (expert-streaming, dense
FFN-streaming), dynamic pool sizing, co-activation tracking, and
99.5% neuron cache hit rate.
Copy file name to clipboardExpand all lines: README.md
+47-12Lines changed: 47 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,26 +1,57 @@
1
-
# Hypura
2
-
3
-
**Run models too big for your Mac's memory.**
1
+
```
2
+
_ _
3
+
| | | |_ _ _ __ _ _ _ __ __ _
4
+
| |_| | | | | '_ \| | | | '__/ _` |
5
+
| _ | |_| | |_) | |_| | | | (_| |
6
+
|_| |_|\__, | .__/ \__,_|_| \__,_|
7
+
|___/|_|
8
+
Run models too big for your Mac's memory
9
+
```
4
10
5
-
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU (Metal), RAM, and NVMe based on access patterns and hardware bandwidth — enabling you to run models that exceed your available memory where vanilla llama.cpp would OOM or thrash to a halt.
11
+
Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon.
12
+
It places model tensors across GPU, RAM, and NVMe tiers based on access
13
+
patterns, bandwidth costs, and hardware capabilities — enabling models
14
+
that exceed physical memory to run without crashing the system.
6
15
7
16
Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.
8
17
9
-
## How it works
18
+
## Why does this matter?
19
+
20
+
Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory
21
+
and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load
22
+
a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.
23
+
24
+
Hypura solves this by understanding the model architecture:
10
25
11
-
On Apple Silicon, GPU memory is unified with system RAM. When a model doesn't fit, llama.cpp either OOMs with GPU offloading or falls back to CPU-only with swap thrashing (~10+ min/token). Hypura adds a third option: **NVMe-tiered scheduling**.
26
+
-**Norms and embeddings** are tiny but accessed every token — pinned to GPU
27
+
-**MoE expert routing** exploits sparsity — only 2 of 8 experts fire per token.
28
+
Router interception identifies selected experts in the eval callback, then loads
29
+
only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks
30
+
loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality.
31
+
Co-activation tracking predicts which experts will fire next for speculative prefetch.
32
+
-**Dense FFN weights** (gate, up, down — ~60% of model size) stream from NVMe through
33
+
a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch
34
+
lookahead depth scales automatically with available memory.
12
35
13
-
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:
36
+
The result: models that would crash your machine under naive mmap become runnable.
37
+
Models that fit in memory run at full Metal GPU speed with zero overhead.
38
+
39
+
## How it works
40
+
41
+
Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth),
42
+
and solves a placement optimization that assigns every tensor to a tier:
-**RAM** — Overflow layers that don't fit in the GPU working set. Accessed via mmap.
17
46
-**NVMe** — Remaining layers loaded on-demand via direct I/O (`F_NOCACHE` + `pread`), prefetched ahead of the forward pass.
18
47
19
-
Hypura selects the best inference mode automatically:
48
+
Hypura selects the best inference mode automatically based on model size, architecture, and available memory:
20
49
21
-
-**Keep-resident** — For small NVMe spill. Loads NVMe layers once, keeps them in memory. Zero disk I/O after the first pass.
50
+
-**Full-resident** — Model fits in GPU+RAM. No NVMe I/O. Full Metal speed.
22
51
-**Expert-streaming** — For MoE models (Mixtral). Only non-expert tensors (~1 GB) stay on GPU. Expert tensors stream from NVMe through a pool buffer on demand, with a neuron cache (99.5% hit rate) that eliminates most I/O after warmup.
23
-
-**Dense FFN-streaming** — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a pool buffer, with layer-ahead prefetch.
52
+
-**Dense FFN-streaming** — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer, with scaled prefetch lookahead.
53
+
54
+
Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.
24
55
25
56
## Performance
26
57
@@ -30,9 +61,9 @@ All benchmarks on **M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read
| Llama 3.3 70B Q4_K_M | 39.6 GB | 7.8 GB | 31.8 GB | dense-FFN-streaming |**0.3 tok/s**|**OOM**| All layers on Metal; dynamic 24-slot pool, 7-layer prefetch |
34
65
35
-
**Key takeaway:** For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B.
66
+
**Key takeaway:** For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.
36
67
37
68
## Install
38
69
@@ -155,3 +186,7 @@ Hypura is a Cargo workspace with two crates:
155
186
## License
156
187
157
188
MIT
189
+
190
+
## Ethics
191
+
192
+
I feel morally obligated to say I did *not* write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.
0 commit comments