|
| 1 | +# mlx-halo |
| 2 | + |
| 3 | +Pre-flight safety checks for MLX models on Apple Silicon. Prevents kernel panics. |
| 4 | + |
| 5 | +Named after F1's Halo cockpit protection device: invisible in normal operation, life-saving when things go wrong. |
| 6 | + |
| 7 | +## The Problem |
| 8 | + |
| 9 | +Loading MLX models on Apple Silicon can cause kernel panics. The failure modes are undocumented: |
| 10 | + |
| 11 | +- **PyTorch and MLX cannot share Metal GPU simultaneously.** Loading a sentence-transformers model (PyTorch/Metal) while an MLX model is active corrupts the Metal heap. |
| 12 | +- **Metal lazily frees memory.** After unloading a model, GPU memory takes 5-10 seconds to actually release. Loading the next model during this window causes overlapping allocations. |
| 13 | +- **Thermal state affects drain time.** Hot silicon is slower to release memory. A settling time that works at 60°C fails at 90°C. |
| 14 | +- **No error — just a kernel panic.** There's no exception, no warning. The machine reboots. |
| 15 | + |
| 16 | +mlx-halo catches these conditions before they crash your system. |
| 17 | + |
| 18 | +## Install |
| 19 | + |
| 20 | +```bash |
| 21 | +pip install mlx-halo |
| 22 | +``` |
| 23 | + |
| 24 | +MLX is an optional dependency (for GPU memory checks): |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install "mlx-halo[mlx]" |
| 28 | +``` |
| 29 | + |
| 30 | +## Quick Start |
| 31 | + |
| 32 | +```python |
| 33 | +from mlx_halo import preflight |
| 34 | + |
| 35 | +# Before loading any MLX model: |
| 36 | +result = preflight(model_size_gb=8.0) |
| 37 | +# Returns HaloResult if safe |
| 38 | +# Raises MemoryError if unsafe |
| 39 | +``` |
| 40 | + |
| 41 | +## What It Checks |
| 42 | + |
| 43 | +Five sequential safety gates, fail-fast: |
| 44 | + |
| 45 | +| Check | What | Why | |
| 46 | +|-------|------|-----| |
| 47 | +| **Conflict** | Is a conflicting framework (e.g. PyTorch) holding the Metal GPU? | PyTorch/Metal and MLX/Metal corrupt each other's heap | |
| 48 | +| **VRAM Drain** | Has the previous model's memory fully released? | Metal lazy deallocation — overlapping allocations panic | |
| 49 | +| **Zombie** | Are there stale model references keeping memory pinned? | Prevents ghost allocations that block new loads | |
| 50 | +| **Pain** | Is the system under thermal/memory pressure? | High pressure + model load = panic territory | |
| 51 | +| **Headroom** | Is there enough free VRAM for this model + safety margin? | Loading into insufficient space corrupts the allocator | |
| 52 | + |
| 53 | +## Usage |
| 54 | + |
| 55 | +### Basic — One Call |
| 56 | + |
| 57 | +```python |
| 58 | +from mlx_halo import preflight |
| 59 | + |
| 60 | +try: |
| 61 | + result = preflight(model_size_gb=8.0) |
| 62 | + print(f"Safe to load (pain={result.pain_score:.2f})") |
| 63 | + # proceed with mlx_lm.load(...) |
| 64 | +except MemoryError as e: |
| 65 | + print(f"Unsafe: {e}") |
| 66 | + # fall back to API model |
| 67 | +``` |
| 68 | + |
| 69 | +### Configurable |
| 70 | + |
| 71 | +```python |
| 72 | +from mlx_halo import HaloCheck |
| 73 | + |
| 74 | +halo = HaloCheck( |
| 75 | + total_vram_gb=32, # Auto-detected if omitted |
| 76 | + safety_margin_gb=3.0, # Extra headroom beyond model size |
| 77 | + pain_threshold=0.7, # Max acceptable pain score |
| 78 | + conflict_check=lambda: pytorch_model is not None, |
| 79 | + zombie_check=lambda: stale_ref is not None, |
| 80 | +) |
| 81 | + |
| 82 | +result = halo.check_all(estimated_model_gb=18.0) |
| 83 | +``` |
| 84 | + |
| 85 | +### Pain Score |
| 86 | + |
| 87 | +The pain calculator quantifies system stress as a single 0.0-1.0 score: |
| 88 | + |
| 89 | +```python |
| 90 | +from mlx_halo import get_current_pain |
| 91 | + |
| 92 | +pain = get_current_pain() |
| 93 | +print(f"Pain: {pain.pain_score:.2f}") |
| 94 | +print(f" Thermal: {pain.thermal_pain:.2f} Crisis: {pain.thermal_crisis}") |
| 95 | +print(f" RAM: {pain.ram_pain:.2f} Crisis: {pain.ram_crisis}") |
| 96 | +print(f" VRAM: {pain.vram_pain:.2f} Crisis: {pain.vram_crisis}") |
| 97 | +``` |
| 98 | + |
| 99 | +| Range | Status | Recommendation | |
| 100 | +|-------|--------|----------------| |
| 101 | +| 0.0-0.3 | GREEN | Safe for large models | |
| 102 | +| 0.3-0.7 | YELLOW | Use medium models, monitor closely | |
| 103 | +| 0.7-1.0 | RED | Refuse local loads, use API/cloud | |
| 104 | + |
| 105 | +Custom thresholds for your hardware: |
| 106 | + |
| 107 | +```python |
| 108 | +from mlx_halo import PainCalculator |
| 109 | + |
| 110 | +calc = PainCalculator( |
| 111 | + thermal_comfort=60.0, # °C where pain starts (default 70) |
| 112 | + thermal_max=95.0, # °C at max pain (default 100) |
| 113 | + vram_comfort_gb=8.0, # GB where VRAM pain starts (default 12) |
| 114 | + vram_max_gb=28.0, # GB at max VRAM pain (default 20) |
| 115 | + thermal_weight=0.5, # Weight in overall score (default 0.4) |
| 116 | + ram_weight=0.3, # (default 0.3) |
| 117 | + vram_weight=0.2, # (default 0.3) |
| 118 | +) |
| 119 | +``` |
| 120 | + |
| 121 | +### System Monitor |
| 122 | + |
| 123 | +Raw hardware metrics without the pain abstraction: |
| 124 | + |
| 125 | +```python |
| 126 | +from mlx_halo import get_monitor |
| 127 | + |
| 128 | +monitor = get_monitor() |
| 129 | +print(f"CPU: {monitor.get_cpu_usage():.1f}%") |
| 130 | +print(f"Temp: {monitor.get_cpu_temperature():.1f}°C") |
| 131 | +print(f"RAM: {monitor.get_ram_usage()['percent']:.1f}%") |
| 132 | +print(f"VRAM: {monitor.get_gpu_vram():.2f} GB") |
| 133 | +print(f"Throttling: {monitor.is_thermal_throttling()}") |
| 134 | +``` |
| 135 | + |
| 136 | +### GPU Memory Management |
| 137 | + |
| 138 | +Direct control over Metal GPU memory: |
| 139 | + |
| 140 | +```python |
| 141 | +from mlx_halo import get_gpu_memory_status, clear_gpu_cache, wait_for_memory_drain |
| 142 | + |
| 143 | +# Check current state |
| 144 | +status = get_gpu_memory_status() |
| 145 | +print(f"Active: {status.active_gb:.2f} GB") |
| 146 | +print(f"Cache: {status.cache_gb:.2f} GB") |
| 147 | +print(f"Available: {status.available_gb:.2f} GB") |
| 148 | + |
| 149 | +# After unloading a model — wait for Metal to actually free the memory |
| 150 | +clear_gpu_cache() |
| 151 | +drained = wait_for_memory_drain( |
| 152 | + baseline_gb=2.0, # Target memory level |
| 153 | + settling_time=5.0, # Seconds to hold below baseline |
| 154 | + verbose=True, |
| 155 | +) |
| 156 | +``` |
| 157 | + |
| 158 | +## Examples |
| 159 | + |
| 160 | +### Model Swap (Unload → Drain → Check → Load) |
| 161 | + |
| 162 | +```python |
| 163 | +import mlx.core as mx |
| 164 | +from mlx_lm import load |
| 165 | +from mlx_halo import preflight, clear_gpu_cache, wait_for_memory_drain |
| 166 | + |
| 167 | +# Unload current model |
| 168 | +del model |
| 169 | +del tokenizer |
| 170 | +clear_gpu_cache() |
| 171 | + |
| 172 | +# Wait for Metal to release memory (thermal-adaptive) |
| 173 | +from mlx_halo import get_current_pain |
| 174 | +pain = get_current_pain() |
| 175 | +wait_for_memory_drain(thermal_pain=pain.thermal_pain, verbose=True) |
| 176 | + |
| 177 | +# Safety check before loading next model |
| 178 | +preflight(model_size_gb=8.0) |
| 179 | + |
| 180 | +# Safe to load |
| 181 | +model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit") |
| 182 | +``` |
| 183 | + |
| 184 | +### Adaptive Model Selection |
| 185 | + |
| 186 | +```python |
| 187 | +from mlx_halo import get_current_pain, get_gpu_memory_status |
| 188 | + |
| 189 | +def select_model(): |
| 190 | + pain = get_current_pain() |
| 191 | + mem = get_gpu_memory_status() |
| 192 | + |
| 193 | + if pain.pain_score > 0.7 or pain.thermal_crisis: |
| 194 | + return None # Use API, system too stressed |
| 195 | + |
| 196 | + if mem.available_gb > 20: |
| 197 | + return "mlx-community/Qwen2.5-32B-Instruct-4bit" # ~18GB |
| 198 | + elif mem.available_gb > 10: |
| 199 | + return "mlx-community/Qwen2.5-7B-Instruct-4bit" # ~5GB |
| 200 | + elif mem.available_gb > 5: |
| 201 | + return "mlx-community/Phi-4-mini-instruct-4bit" # ~3GB |
| 202 | + else: |
| 203 | + return None # Not enough room |
| 204 | +``` |
| 205 | + |
| 206 | +### Embedding Model Conflict Guard |
| 207 | + |
| 208 | +```python |
| 209 | +from sentence_transformers import SentenceTransformer |
| 210 | +from mlx_halo import HaloCheck |
| 211 | + |
| 212 | +# Track whether PyTorch embeddings are loaded |
| 213 | +embedder = None |
| 214 | + |
| 215 | +def load_embedder(): |
| 216 | + global embedder |
| 217 | + embedder = SentenceTransformer("all-MiniLM-L6-v2") |
| 218 | + |
| 219 | +def unload_embedder(): |
| 220 | + global embedder |
| 221 | + del embedder |
| 222 | + embedder = None |
| 223 | + |
| 224 | +# Halo knows to check for the conflict |
| 225 | +halo = HaloCheck( |
| 226 | + conflict_check=lambda: embedder is not None, |
| 227 | +) |
| 228 | + |
| 229 | +# This will raise MemoryError if embedder is still loaded: |
| 230 | +halo.check_all(estimated_model_gb=8.0) |
| 231 | +``` |
| 232 | + |
| 233 | +### Continuous Monitoring Loop |
| 234 | + |
| 235 | +```python |
| 236 | +import time |
| 237 | +from mlx_halo import get_current_pain, HealthStatus |
| 238 | + |
| 239 | +while True: |
| 240 | + pain = get_current_pain() |
| 241 | + status = "OK" if pain.pain_score < 0.3 else "WARN" if pain.pain_score < 0.7 else "CRIT" |
| 242 | + print(f"[{status}] pain={pain.pain_score:.2f} " |
| 243 | + f"thermal={pain.thermal_pain:.2f} " |
| 244 | + f"ram={pain.ram_pain:.2f} " |
| 245 | + f"vram={pain.vram_pain:.2f}") |
| 246 | + |
| 247 | + if pain.thermal_crisis: |
| 248 | + print(" THERMAL CRISIS — unload models immediately") |
| 249 | + time.sleep(10) |
| 250 | +``` |
| 251 | + |
| 252 | +## Hardware Compatibility |
| 253 | + |
| 254 | +Tested on Apple Silicon M1-M4 (MacBook Air, MacBook Pro, Mac Mini, Mac Studio). The thermal monitoring uses `powermetrics` which requires sudo — without it, temperature is estimated from CPU load (less accurate but functional). |
| 255 | + |
| 256 | +For accurate thermal monitoring, configure passwordless sudo for powermetrics: |
| 257 | + |
| 258 | +```bash |
| 259 | +echo "$USER ALL=(ALL) NOPASSWD: /usr/bin/powermetrics" | sudo tee /etc/sudoers.d/powermetrics |
| 260 | +``` |
| 261 | + |
| 262 | +## License |
| 263 | + |
| 264 | +[Liberation License v1.0](LICENSE.md) |
0 commit comments