Skip to content

Commit 0042d83

Browse files
HarperHarper
authored andcommitted
feat(v0.11.0): 7 Harper-private ports + KNOWN_PANIC_MODELS upgrade + PEP 639 hotfix
Combined release: install hotfix + second-wave defence ports informed by the 2026-04-27 community sweep (mlx-lm#1185 / #1206 / mlx-vlm#1064 / omlx#578 / #862 / #902). Hotfix: removed `License :: OSI Approved :: MIT License` legacy classifier from pyproject.toml. Modern setuptools (≥80) rejects the SPDX expression + classifier combo with `InvalidConfigError: License classifiers have been superseded by license expressions`. Every CI run since v0.9.0 (2026-04-24) was failing on `pip install -e .` and `pip install git+https://...@v0.10.0` for this reason. New layers: - error_classifier — central regex table for 6 severities (kernel_panic / process_abort / command_buffer_oom / gpu_hang / gpu_page_fault / descriptor_leak). SubprocessCrashError now exposes `error_class` + `recovery_hint` for caller routing. - L10b scan_recent_aborts — sibling to scan_recent_panics. Reads `~/Library/Logs/DiagnosticReports/*.ips` (where macOS writes process aborts) plus cascade-into-panic files. Default 24h window vs panics' 72h. CooldownVerdict.abort_count_24h informational, does NOT influence exit_code (lockout reserved for kernel panics). - L13b apple_gpu_family — reads `mx.device_info()` and maps applegpu_g13/g14/g15/g16/g17 → M1/M2/M3/M4/M5. Surfaces `resource_limit` (mlx-lm#1185 descriptor cap, 499000 on M1 Ultra). - L14 ResourceTracker — descriptor-leak heuristic: counts inferences since cold-restart, threshold 4000 (env override), kill switch. Subprocess respawn is the only known way to release accumulated Metal descriptors per mlx-lm#1185. - breadcrumb_with_meta — structured breadcrumb format `[ts] TAG: payload | k=v k=v` for richer postmortem forensics. Fixed L11 `_BREADCRUMB_LINE_RE` greedy `.*$` regex that would have swallowed meta into payload (FIFO-pairing bug Harper-local critic caught earlier). KNOWN_PANIC_MODELS schema upgrade: - Added `tier` (panic / abort / degradation), `error_classes[]` (with type/signature/first_seen_via/hardware/gpu_family/workload/mitigation), `verified_safe_alternative`. Legacy fields preserved so v0.10 callers continue to work. - 4 new entries: Qwen3.5-27B-4bit (descriptor leak), Qwen3.5-35B-A3B-8bit (descriptor leak + streaming abort), Qwen3.6-35B-A3B-8bit (DFlash abort), Qwen3-VL-2B-Instruct (M5 Max GPU hang + page fault). - New helpers: check_known_panic_model_for_gpu(model, gpu_family), models_by_tier(tier), models_affecting_gpu_family(family). Critic R1 caught 1 P0 (scan_recent_aborts only globbed *.panic, missing real *.ips abort reports — 2 regression tests added) + 4 P1 (3 fixed inline: pipe sanitization in breadcrumb_with_meta, schema sanity test locking tier vocab + signature drift detection, signature drift fix on Qwen3.6 entry; 1 deferred: tier secondary-class lookup). R2 GO 0 P0/P1 blocking. 207 tests pass (44 v0.11 layer tests + 163 v0.10 baseline, excluding 18 pre-existing test_metal_guard.py failures unrelated to this release).
1 parent 49de0b7 commit 0042d83

8 files changed

Lines changed: 1576 additions & 29 deletions

File tree

CHANGELOG.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,150 @@ All notable changes to **metal-guard** are documented here.
55
The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
66
and this project adheres to [Semantic Versioning](https://semver.org/).
77

8+
## [0.11.0] — 2026-04-28
9+
10+
Release combining the v0.10.1 install hotfix with second-wave Harper-
11+
private feature ports informed by the 2026-04-27 community sweep
12+
(mlx-lm#1185, mlx-lm#1206, mlx-vlm#1064, omlx#578/#862/#902).
13+
14+
### Fixed (was v0.10.1 hotfix)
15+
16+
- **PEP 639 conflict in `pyproject.toml`** preventing editable install
17+
on modern setuptools (`License classifiers have been superseded by
18+
license expressions`). v0.10.0 declared both
19+
`license = "MIT"` (SPDX expression) AND
20+
`License :: OSI Approved :: MIT License` (classifier) — modern
21+
setuptools (≥80) rejected the conflict with `InvalidConfigError`,
22+
blocking every `pip install -e .` and `pip install
23+
git+https://github.com/Harperbot/metal-guard.git@v0.10.0`. **Every
24+
CI run since v0.9.0 (2026-04-24) failed for this reason**, and the
25+
README Option A install path documented in v0.10.0 was actually
26+
broken on modern Python toolchains. SPDX expression is now the
27+
single source of truth.
28+
29+
- **L11 orphan-monitor regex over-greedy**`_BREADCRUMB_LINE_RE`
30+
used `(?P<payload>.*)$` which swallowed any trailing
31+
` | k=v ...` metadata into the payload group. FIFO pairing in
32+
`scan_orphan_subproc_pre` keys by the full string, so PRE/POST
33+
written via `breadcrumb_with_meta()` (new in v0.11.0) with different
34+
meta would never match → false-positive orphan storm. Regex now
35+
lazy-stops at the optional ` | <meta>` separator.
36+
37+
### Added — `error_classifier` (informed by 2026-04 community sweep)
38+
39+
Central regex table (`classify_mlx_error(text) -> ErrorClass | None`)
40+
covering 7 distinct MLX-related error signatures across 6 severity
41+
classes:
42+
43+
| Severity | Recovery hint | Source signal |
44+
|---|---|---|
45+
| `kernel_panic` | `wait_lockout` | `prepare_count_underflow` + `IOGPUMemory.cpp` |
46+
| `kernel_panic` | `wait_lockout` | `IOGPUGroupMemory.cpp:219` `fPendingMemorySet` |
47+
| `command_buffer_oom` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorOutOfMemory` (mlx-lm#1206) |
48+
| `gpu_hang` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorHang` (mlx-vlm#1064) |
49+
| `gpu_page_fault` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorPageFault` |
50+
| `descriptor_leak` | `force_reload` | `[metal::malloc] Resource limit (N) exceeded` (mlx-lm#1185) |
51+
| `process_abort` | `respawn_now` | MetalStream SIGABRT, generic command buffer failure |
52+
53+
INVARIANT: kernel-panic entries are first in the priority table; when
54+
both kernel + abort signatures appear in one log, kernel wins so the
55+
abort counter doesn't double-count machines that already rebooted.
56+
57+
`SubprocessCrashError` now auto-classifies `detail` on construction
58+
and exposes `error_class` + `recovery_hint` for caller routing.
59+
60+
### Added — L10b: process-abort scanner
61+
62+
- `scan_recent_aborts(hours=24.0)` — sibling to `scan_recent_panics`
63+
but for non-rebooting failures (default 24h vs 72h window since
64+
aborts decay quicker).
65+
- `AbortRecord` dataclass with `error_class` field.
66+
- `CooldownVerdict.abort_count_24h` — informational only, exposed for
67+
dashboard surface but **does NOT** influence `exit_code`. The
68+
staircase lockout remains reserved for kernel panics that actually
69+
rebooted the machine.
70+
71+
### Added — L13b: Apple GPU family detection
72+
73+
- `apple_gpu_family() -> dict` reads `mx.device_info()`:
74+
`architecture`, `resource_limit`, `max_buffer_length`,
75+
`max_recommended_working_set_size`, `memory_size`. Maps to family
76+
`M1` / `M2` / `M3` / `M4` / `M5` via `applegpu_g13` / `g14` / `g15`
77+
/ `g16` / `g17` prefix. mlx-lm#1206 hypothesises that
78+
`applegpu_g17s` (M5 Max) has command-buffer limits independent of
79+
RAM, so per-family classification feeds `KNOWN_PANIC_MODELS`
80+
filtering.
81+
82+
### Added — L14: descriptor-leak heuristic
83+
84+
- `ResourceTracker(cold_restart_after=4000)` — thread-safe inference
85+
counter targeting mlx-lm#1185 descriptor leak (Resource limit
86+
exceeded). Caller calls `record_inference()` after each generate;
87+
`should_cold_restart()` returns True at threshold so caller can
88+
shutdown + spawn new subprocess to release accumulated descriptors.
89+
`mx.clear_cache()` releases buffers but descriptor handles
90+
accumulate independently — only subprocess respawn fully releases.
91+
- Env knobs: `METALGUARD_COLD_RESTART_AFTER_N`,
92+
`METALGUARD_COLD_RESTART_DISABLED=1` (kill switch).
93+
94+
### Added — `breadcrumb_with_meta()`
95+
96+
- `metal_guard.breadcrumb_with_meta(tag, payload, **meta)` — structured
97+
breadcrumb format `[ts] TAG: payload | k1=v1 k2=v2`. Lets caller
98+
attach `ctx`, `kv_bytes`, `elapsed_ms`, `tok_out`, `error_class`,
99+
`descriptor_used` for richer postmortem forensics.
100+
- L11 `_BREADCRUMB_LINE_RE` updated to lazy regex with optional `meta`
101+
capture group — backward-compatible with legacy `breadcrumb()`
102+
callers.
103+
104+
### Changed — `KNOWN_PANIC_MODELS` schema
105+
106+
Schema upgrade adds three optional fields to each entry (legacy fields
107+
preserved for backward-compat with v0.9 / v0.10 callers):
108+
109+
- `tier`: `"panic"` (kernel-level, reboots Mac) /
110+
`"abort"` (process-level SIGABRT or hang) /
111+
`"degradation"` (slow descriptor leak, no abort).
112+
- `error_classes[]`: list of distinct failure modes per model. Each
113+
entry has `type` / `signature` / `first_seen_via` / `hardware` /
114+
`gpu_family` / `workload` / `mitigation`. Multiple modes per model
115+
(e.g. mlx-vlm#1064 has both `Hang` and `PageFault` variants).
116+
- `verified_safe_alternative`: known-safe pivot model_id.
117+
118+
New helper functions:
119+
120+
- `check_known_panic_model_for_gpu(model_id, gpu_family="M5")`
121+
filters `error_classes` by GPU family. Returns None when the model
122+
is in registry but no error_classes apply to your hardware.
123+
- `models_by_tier(tier)` — query by severity tier.
124+
- `models_affecting_gpu_family(family)` — list models confirmed on
125+
family.
126+
127+
### Added — 4 new `KNOWN_PANIC_MODELS` entries
128+
129+
- `mlx-community/Qwen3.5-27B-4bit` — degradation (LoRA descriptor leak,
130+
M4 Max, mlx-lm#1185).
131+
- `mlx-community/Qwen3.5-35B-A3B-8bit` — degradation + abort (LoRA
132+
leak #1185 + long-context streaming abort omlx#578).
133+
- `mlx-community/Qwen3.6-35B-A3B-8bit` — abort (DFlash drafter,
134+
omlx#902). Mitigation: disable DFlash.
135+
- `mlx-community/Qwen3-VL-2B-Instruct` — abort (M5 Max GPU hang +
136+
page fault, mlx-vlm#1064). Mitigation: avoid M5 Max; M1-M4 untested.
137+
138+
The original `gemma-4-31b-it-8bit` entry retains its legacy fields and
139+
adds the new schema fields.
140+
141+
### Notes
142+
143+
The earliest test of the registry's value: **v0.11.0 ships data on
144+
five distinct (model × hardware × workload) combinations, not just
145+
one**. If a user on M5 Max hits Qwen3-VL hang, they can now query
146+
metal-guard before debugging upstream. If a user on M4 Max starts a
147+
LoRA on Qwen3.5-27B, they can wire `ResourceTracker` from day one
148+
instead of waiting for their first `Resource limit exceeded` crash.
149+
150+
Bump `pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"`.
151+
8152
## [0.10.0] — 2026-04-27
9153

10154
Promotes four Harper-private defence layers (`L10`-`L13`) to the public

README.ja.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Apple Silicon 上で [MLX](https://github.com/ml-explore/mlx) を動かすため
66

77
MLX 推論中に Metal ドライバーのバグが引き起こすカーネルパニックや OOM クラッシュを防止します —— 特にマルチモデルパイプライン、長時間稼働サーバー、ツール呼び出しが多いエージェントフレームワークを想定しています。
88

9-
**現在のバージョン:v0.10.0** — リリース履歴と各機能の背景は [CHANGELOG.md](CHANGELOG.md) を参照してください。
9+
**現在のバージョン:v0.11.0** — リリース履歴と各機能の背景は [CHANGELOG.md](CHANGELOG.md) を参照してください。
1010

1111
### v0.10 で追加されたもの
1212

@@ -124,13 +124,13 @@ panic(cpu 4 caller 0xfffffe0032a550f8):
124124
タグリリースからインストール —— `metal-guard``mlx-safe-python` の console scripts に加え、`metal_guard` Python モジュールが入ります:
125125

126126
```bash
127-
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
127+
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
128128
```
129129

130130
インストール後:
131131

132132
```bash
133-
metal-guard --version # → metal-guard 0.10.0
133+
metal-guard --version # → metal-guard 0.11.0
134134
metal-guard panic-gate # L10 cooldown 判定
135135
metal-guard status # フルスナップショット
136136
mlx-safe-python -c "import torch" # 対話シェルガード
@@ -145,7 +145,7 @@ mlx-safe-python -c "import torch" # 対話シェルガード
145145
```bash
146146
mkdir -p ~/lib/metal-guard
147147
curl -L -o ~/lib/metal-guard/metal_guard.py \
148-
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
148+
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
149149
```
150150

151151
コード内で:
@@ -179,7 +179,7 @@ $ metal-guard panic-gate
179179
🟢 PROCEED no recent IOGPU panics
180180
24h=0 72h=0
181181
$ metal-guard status
182-
metal-guard 0.10.0 🟢 OK
182+
metal-guard 0.11.0 🟢 OK
183183
mode defensive — defensive mode (default)
184184
panics 0 in last 72h
185185
...

README.md

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,20 @@ GPU safety layer for [MLX](https://github.com/ml-explore/mlx) on Apple Silicon.
66

77
Prevents kernel panics and OOM crashes caused by Metal driver bugs when running MLX inference — especially multi-model pipelines, long-running servers, and agent frameworks with heavy tool calling.
88

9-
**Current version: v0.10.0** — see [CHANGELOG.md](CHANGELOG.md) for release history and per-feature rationale.
9+
**Current version: v0.11.0** — see [CHANGELOG.md](CHANGELOG.md) for release history and per-feature rationale.
10+
11+
### What's in v0.11
12+
13+
Built on the 2026-04-27 community sweep (mlx-lm#1185 / #1206 / mlx-vlm#1064 / omlx#578 / #862 / #902):
14+
15+
- **`error_classifier`** — central regex table for **6 distinct error severities**: `kernel_panic` / `process_abort` / `command_buffer_oom` / `gpu_hang` / `gpu_page_fault` / `descriptor_leak`. `SubprocessCrashError` now exposes `.error_class` + `.recovery_hint` for caller routing.
16+
- **L10b — process-abort scanner**`scan_recent_aborts(24h)` sibling to `scan_recent_panics(72h)`. Aborts are non-rebooting failures; counted separately so they don't trip the kernel-panic lockout. `CooldownVerdict.abort_count_24h` exposed for dashboards.
17+
- **L13b — Apple GPU family detection**`apple_gpu_family()` reads `mx.device_info()` and maps `applegpu_g13`/`g14`/`g15`/`g16`/`g17``M1`/`M2`/`M3`/`M4`/`M5`. Surfaces `resource_limit` (mlx-lm#1185 descriptor cap, 499000 on M1 Ultra).
18+
- **L14 — descriptor-leak heuristic**`ResourceTracker(cold_restart_after=4000)` tracks inferences-since-cold-restart so callers can pre-emptively `shutdown()` + spawn new subprocess before hitting the descriptor limit. `mx.clear_cache()` doesn't release descriptor handles; only subprocess respawn does.
19+
- **`breadcrumb_with_meta(tag, payload, **meta)`** — structured breadcrumb format `[ts] TAG: payload | k=v k=v` for richer postmortem forensics. L11 orphan parser updated to lazy regex (backward-compat with legacy `breadcrumb()`).
20+
- **`KNOWN_PANIC_MODELS` schema upgrade** — adds `tier` (panic / abort / degradation), `error_classes[]` (multiple modes per model + per-GPU-family confirmation), `verified_safe_alternative`. New helpers: `check_known_panic_model_for_gpu(model, gpu_family="M5")` / `models_by_tier()` / `models_affecting_gpu_family()`.
21+
- **4 new registry entries** covering Qwen3.5/Qwen3.6/Qwen3-VL family across M4 / M5 hardware.
22+
- **Hotfix**: PEP 639 license-classifier conflict in `pyproject.toml` that blocked every `pip install` since v0.9.0.
1023

1124
### What's in v0.10
1225

@@ -127,13 +140,13 @@ This affects any workflow that loads and unloads multiple MLX models in sequence
127140
Installs from a tagged release — gives you the `metal-guard` and `mlx-safe-python` console scripts plus the `metal_guard` Python module:
128141

129142
```bash
130-
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
143+
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
131144
```
132145

133146
After install:
134147

135148
```bash
136-
metal-guard --version # → metal-guard 0.10.0
149+
metal-guard --version # → metal-guard 0.11.0
137150
metal-guard panic-gate # L10 cooldown verdict
138151
metal-guard status # full snapshot
139152
mlx-safe-python -c "import torch" # interactive shell guard
@@ -148,7 +161,7 @@ To upgrade to a future release: `pip install --upgrade "git+https://github.com/H
148161
```bash
149162
mkdir -p ~/lib/metal-guard
150163
curl -L -o ~/lib/metal-guard/metal_guard.py \
151-
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
164+
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
152165
```
153166

154167
Then in your code:
@@ -182,7 +195,7 @@ $ metal-guard panic-gate
182195
🟢 PROCEED no recent IOGPU panics
183196
24h=0 72h=0
184197
$ metal-guard status
185-
metal-guard 0.10.0 🟢 OK
198+
metal-guard 0.11.0 🟢 OK
186199
mode defensive — defensive mode (default)
187200
panics 0 in last 72h
188201
...
@@ -238,7 +251,7 @@ metal_guard.start_kv_cache_monitor(headroom_gb=config["kv_headroom_gb"])
238251
MetalGuard is organised as **defence layers (L1–L13)** plus a set of
239252
**preventive helpers (R-series)** and the **`KNOWN_PANIC_MODELS` registry**.
240253
Every feature is available from the single `metal_guard` module — install via
241-
`pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"` or
254+
`pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"` or
242255
drop `metal_guard.py` in your `PYTHONPATH` (see [Installation](#installation) above). See [CHANGELOG.md](CHANGELOG.md) for when
243256
each layer landed and the incident that motivated it.
244257

README.zh-TW.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Apple Silicon 上 [MLX](https://github.com/ml-explore/mlx) 的 GPU 安全層。
66

77
防止 MLX 推論時 Metal 驅動程式 bug 造成的 kernel panic 與 OOM 崩潰 —— 尤其是多模型 pipeline、長時間運行的 server、以及大量 tool calling 的 agent 框架。
88

9-
**當前版本:v0.10.0** — 發行歷史與每個功能的背景見 [CHANGELOG.md](CHANGELOG.md)
9+
**當前版本:v0.11.0** — 發行歷史與每個功能的背景見 [CHANGELOG.md](CHANGELOG.md)
1010

1111
### v0.10 帶來什麼
1212

@@ -124,13 +124,13 @@ panic(cpu 4 caller 0xfffffe0032a550f8):
124124
從 tag release 裝 —— 會拿到 `metal-guard``mlx-safe-python` 兩個 console scripts 跟 `metal_guard` Python module:
125125

126126
```bash
127-
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
127+
pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
128128
```
129129

130130
裝完:
131131

132132
```bash
133-
metal-guard --version # → metal-guard 0.10.0
133+
metal-guard --version # → metal-guard 0.11.0
134134
metal-guard panic-gate # L10 cooldown 判斷
135135
metal-guard status # 完整 snapshot
136136
mlx-safe-python -c "import torch" # 互動 shell 守衛
@@ -145,7 +145,7 @@ mlx-safe-python -c "import torch" # 互動 shell 守衛
145145
```bash
146146
mkdir -p ~/lib/metal-guard
147147
curl -L -o ~/lib/metal-guard/metal_guard.py \
148-
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
148+
https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
149149
```
150150

151151
程式裡:
@@ -179,7 +179,7 @@ $ metal-guard panic-gate
179179
🟢 PROCEED no recent IOGPU panics
180180
24h=0 72h=0
181181
$ metal-guard status
182-
metal-guard 0.10.0 🟢 OK
182+
metal-guard 0.11.0 🟢 OK
183183
mode defensive — defensive mode (default)
184184
panics 0 in last 72h
185185
...

0 commit comments

Comments
 (0)