Skip to content

Commit 9f3cf9e

Browse files
alyssapowellclaude
andcommitted
add README with examples, usage guide, and hardware notes
Covers: quick start, configurable checks, pain calculator, system monitor, GPU memory management. Examples: model swap, adaptive selection, embedding conflict guard, continuous monitoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 9f19948 commit 9f3cf9e

6 files changed

Lines changed: 270 additions & 0 deletions

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
__pycache__/
2+
*.pyc
3+
.pytest_cache/
4+
*.egg-info/
5+
dist/
6+
build/

README.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
# mlx-halo
2+
3+
Pre-flight safety checks for MLX models on Apple Silicon. Prevents kernel panics.
4+
5+
Named after F1's Halo cockpit protection device: invisible in normal operation, life-saving when things go wrong.
6+
7+
## The Problem
8+
9+
Loading MLX models on Apple Silicon can cause kernel panics. The failure modes are undocumented:
10+
11+
- **PyTorch and MLX cannot share Metal GPU simultaneously.** Loading a sentence-transformers model (PyTorch/Metal) while an MLX model is active corrupts the Metal heap.
12+
- **Metal lazily frees memory.** After unloading a model, GPU memory takes 5-10 seconds to actually release. Loading the next model during this window causes overlapping allocations.
13+
- **Thermal state affects drain time.** Hot silicon is slower to release memory. A settling time that works at 60°C fails at 90°C.
14+
- **No error — just a kernel panic.** There's no exception, no warning. The machine reboots.
15+
16+
mlx-halo catches these conditions before they crash your system.
17+
18+
## Install
19+
20+
```bash
21+
pip install mlx-halo
22+
```
23+
24+
MLX is an optional dependency (for GPU memory checks):
25+
26+
```bash
27+
pip install "mlx-halo[mlx]"
28+
```
29+
30+
## Quick Start
31+
32+
```python
33+
from mlx_halo import preflight
34+
35+
# Before loading any MLX model:
36+
result = preflight(model_size_gb=8.0)
37+
# Returns HaloResult if safe
38+
# Raises MemoryError if unsafe
39+
```
40+
41+
## What It Checks
42+
43+
Five sequential safety gates, fail-fast:
44+
45+
| Check | What | Why |
46+
|-------|------|-----|
47+
| **Conflict** | Is a conflicting framework (e.g. PyTorch) holding the Metal GPU? | PyTorch/Metal and MLX/Metal corrupt each other's heap |
48+
| **VRAM Drain** | Has the previous model's memory fully released? | Metal lazy deallocation — overlapping allocations panic |
49+
| **Zombie** | Are there stale model references keeping memory pinned? | Prevents ghost allocations that block new loads |
50+
| **Pain** | Is the system under thermal/memory pressure? | High pressure + model load = panic territory |
51+
| **Headroom** | Is there enough free VRAM for this model + safety margin? | Loading into insufficient space corrupts the allocator |
52+
53+
## Usage
54+
55+
### Basic — One Call
56+
57+
```python
58+
from mlx_halo import preflight
59+
60+
try:
61+
result = preflight(model_size_gb=8.0)
62+
print(f"Safe to load (pain={result.pain_score:.2f})")
63+
# proceed with mlx_lm.load(...)
64+
except MemoryError as e:
65+
print(f"Unsafe: {e}")
66+
# fall back to API model
67+
```
68+
69+
### Configurable
70+
71+
```python
72+
from mlx_halo import HaloCheck
73+
74+
halo = HaloCheck(
75+
total_vram_gb=32, # Auto-detected if omitted
76+
safety_margin_gb=3.0, # Extra headroom beyond model size
77+
pain_threshold=0.7, # Max acceptable pain score
78+
conflict_check=lambda: pytorch_model is not None,
79+
zombie_check=lambda: stale_ref is not None,
80+
)
81+
82+
result = halo.check_all(estimated_model_gb=18.0)
83+
```
84+
85+
### Pain Score
86+
87+
The pain calculator quantifies system stress as a single 0.0-1.0 score:
88+
89+
```python
90+
from mlx_halo import get_current_pain
91+
92+
pain = get_current_pain()
93+
print(f"Pain: {pain.pain_score:.2f}")
94+
print(f" Thermal: {pain.thermal_pain:.2f} Crisis: {pain.thermal_crisis}")
95+
print(f" RAM: {pain.ram_pain:.2f} Crisis: {pain.ram_crisis}")
96+
print(f" VRAM: {pain.vram_pain:.2f} Crisis: {pain.vram_crisis}")
97+
```
98+
99+
| Range | Status | Recommendation |
100+
|-------|--------|----------------|
101+
| 0.0-0.3 | GREEN | Safe for large models |
102+
| 0.3-0.7 | YELLOW | Use medium models, monitor closely |
103+
| 0.7-1.0 | RED | Refuse local loads, use API/cloud |
104+
105+
Custom thresholds for your hardware:
106+
107+
```python
108+
from mlx_halo import PainCalculator
109+
110+
calc = PainCalculator(
111+
thermal_comfort=60.0, # °C where pain starts (default 70)
112+
thermal_max=95.0, # °C at max pain (default 100)
113+
vram_comfort_gb=8.0, # GB where VRAM pain starts (default 12)
114+
vram_max_gb=28.0, # GB at max VRAM pain (default 20)
115+
thermal_weight=0.5, # Weight in overall score (default 0.4)
116+
ram_weight=0.3, # (default 0.3)
117+
vram_weight=0.2, # (default 0.3)
118+
)
119+
```
120+
121+
### System Monitor
122+
123+
Raw hardware metrics without the pain abstraction:
124+
125+
```python
126+
from mlx_halo import get_monitor
127+
128+
monitor = get_monitor()
129+
print(f"CPU: {monitor.get_cpu_usage():.1f}%")
130+
print(f"Temp: {monitor.get_cpu_temperature():.1f}°C")
131+
print(f"RAM: {monitor.get_ram_usage()['percent']:.1f}%")
132+
print(f"VRAM: {monitor.get_gpu_vram():.2f} GB")
133+
print(f"Throttling: {monitor.is_thermal_throttling()}")
134+
```
135+
136+
### GPU Memory Management
137+
138+
Direct control over Metal GPU memory:
139+
140+
```python
141+
from mlx_halo import get_gpu_memory_status, clear_gpu_cache, wait_for_memory_drain
142+
143+
# Check current state
144+
status = get_gpu_memory_status()
145+
print(f"Active: {status.active_gb:.2f} GB")
146+
print(f"Cache: {status.cache_gb:.2f} GB")
147+
print(f"Available: {status.available_gb:.2f} GB")
148+
149+
# After unloading a model — wait for Metal to actually free the memory
150+
clear_gpu_cache()
151+
drained = wait_for_memory_drain(
152+
baseline_gb=2.0, # Target memory level
153+
settling_time=5.0, # Seconds to hold below baseline
154+
verbose=True,
155+
)
156+
```
157+
158+
## Examples
159+
160+
### Model Swap (Unload → Drain → Check → Load)
161+
162+
```python
163+
import mlx.core as mx
164+
from mlx_lm import load
165+
from mlx_halo import preflight, clear_gpu_cache, wait_for_memory_drain
166+
167+
# Unload current model
168+
del model
169+
del tokenizer
170+
clear_gpu_cache()
171+
172+
# Wait for Metal to release memory (thermal-adaptive)
173+
from mlx_halo import get_current_pain
174+
pain = get_current_pain()
175+
wait_for_memory_drain(thermal_pain=pain.thermal_pain, verbose=True)
176+
177+
# Safety check before loading next model
178+
preflight(model_size_gb=8.0)
179+
180+
# Safe to load
181+
model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
182+
```
183+
184+
### Adaptive Model Selection
185+
186+
```python
187+
from mlx_halo import get_current_pain, get_gpu_memory_status
188+
189+
def select_model():
190+
pain = get_current_pain()
191+
mem = get_gpu_memory_status()
192+
193+
if pain.pain_score > 0.7 or pain.thermal_crisis:
194+
return None # Use API, system too stressed
195+
196+
if mem.available_gb > 20:
197+
return "mlx-community/Qwen2.5-32B-Instruct-4bit" # ~18GB
198+
elif mem.available_gb > 10:
199+
return "mlx-community/Qwen2.5-7B-Instruct-4bit" # ~5GB
200+
elif mem.available_gb > 5:
201+
return "mlx-community/Phi-4-mini-instruct-4bit" # ~3GB
202+
else:
203+
return None # Not enough room
204+
```
205+
206+
### Embedding Model Conflict Guard
207+
208+
```python
209+
from sentence_transformers import SentenceTransformer
210+
from mlx_halo import HaloCheck
211+
212+
# Track whether PyTorch embeddings are loaded
213+
embedder = None
214+
215+
def load_embedder():
216+
global embedder
217+
embedder = SentenceTransformer("all-MiniLM-L6-v2")
218+
219+
def unload_embedder():
220+
global embedder
221+
del embedder
222+
embedder = None
223+
224+
# Halo knows to check for the conflict
225+
halo = HaloCheck(
226+
conflict_check=lambda: embedder is not None,
227+
)
228+
229+
# This will raise MemoryError if embedder is still loaded:
230+
halo.check_all(estimated_model_gb=8.0)
231+
```
232+
233+
### Continuous Monitoring Loop
234+
235+
```python
236+
import time
237+
from mlx_halo import get_current_pain, HealthStatus
238+
239+
while True:
240+
pain = get_current_pain()
241+
status = "OK" if pain.pain_score < 0.3 else "WARN" if pain.pain_score < 0.7 else "CRIT"
242+
print(f"[{status}] pain={pain.pain_score:.2f} "
243+
f"thermal={pain.thermal_pain:.2f} "
244+
f"ram={pain.ram_pain:.2f} "
245+
f"vram={pain.vram_pain:.2f}")
246+
247+
if pain.thermal_crisis:
248+
print(" THERMAL CRISIS — unload models immediately")
249+
time.sleep(10)
250+
```
251+
252+
## Hardware Compatibility
253+
254+
Tested on Apple Silicon M1-M4 (MacBook Air, MacBook Pro, Mac Mini, Mac Studio). The thermal monitoring uses `powermetrics` which requires sudo — without it, temperature is estimated from CPU load (less accurate but functional).
255+
256+
For accurate thermal monitoring, configure passwordless sudo for powermetrics:
257+
258+
```bash
259+
echo "$USER ALL=(ALL) NOPASSWD: /usr/bin/powermetrics" | sudo tee /etc/sudoers.d/powermetrics
260+
```
261+
262+
## License
263+
264+
[Liberation License v1.0](LICENSE.md)
-13.6 KB
Binary file not shown.
-26.3 KB
Binary file not shown.
-15.9 KB
Binary file not shown.
-9.54 KB
Binary file not shown.

0 commit comments

Comments
 (0)