Harperbot
diff --git a/‎CHANGELOG.md‎
Lines changed: 144 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 144 additions & 0 deletions
diff --git a/‎README.ja.md‎
Lines changed: 5 additions & 5 deletions b/‎README.ja.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 19 additions & 6 deletions b/‎README.md‎
Lines changed: 19 additions & 6 deletions
diff --git a/‎README.zh-TW.md‎
Lines changed: 5 additions & 5 deletions b/‎README.zh-TW.md‎
Lines changed: 5 additions & 5 deletions
@@ -5,6 +5,150 @@ All notable changes to **metal-guard** are documented here.
 The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/).
 
+## [0.11.0] — 2026-04-28
+
+Release combining the v0.10.1 install hotfix with second-wave Harper-
+private feature ports informed by the 2026-04-27 community sweep
+(mlx-lm#1185, mlx-lm#1206, mlx-vlm#1064, omlx#578/#862/#902).
+
+### Fixed (was v0.10.1 hotfix)
+
+- **PEP 639 conflict in `pyproject.toml`** preventing editable install
+  on modern setuptools (`License classifiers have been superseded by
+  license expressions`). v0.10.0 declared both
+  `license = "MIT"` (SPDX expression) AND
+  `License :: OSI Approved :: MIT License` (classifier) — modern
+  setuptools (≥80) rejected the conflict with `InvalidConfigError`,
+  blocking every `pip install -e .` and `pip install
+  git+https://github.com/Harperbot/metal-guard.git@v0.10.0`. **Every
+  CI run since v0.9.0 (2026-04-24) failed for this reason**, and the
+  README Option A install path documented in v0.10.0 was actually
+  broken on modern Python toolchains. SPDX expression is now the
+  single source of truth.
+
+- **L11 orphan-monitor regex over-greedy** — `_BREADCRUMB_LINE_RE`
+  used `(?P<payload>.*)$` which swallowed any trailing
+  ` | k=v ...` metadata into the payload group. FIFO pairing in
+  `scan_orphan_subproc_pre` keys by the full string, so PRE/POST
+  written via `breadcrumb_with_meta()` (new in v0.11.0) with different
+  meta would never match → false-positive orphan storm. Regex now
+  lazy-stops at the optional ` | <meta>` separator.
+
+### Added — `error_classifier` (informed by 2026-04 community sweep)
+
+Central regex table (`classify_mlx_error(text) -> ErrorClass | None`)
+covering 7 distinct MLX-related error signatures across 6 severity
+classes:
+
+| Severity | Recovery hint | Source signal |
+|---|---|---|
+| `kernel_panic` | `wait_lockout` | `prepare_count_underflow` + `IOGPUMemory.cpp` |
+| `kernel_panic` | `wait_lockout` | `IOGPUGroupMemory.cpp:219` `fPendingMemorySet` |
+| `command_buffer_oom` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorOutOfMemory` (mlx-lm#1206) |
+| `gpu_hang` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorHang` (mlx-vlm#1064) |
+| `gpu_page_fault` | `respawn_now` | `kIOGPUCommandBufferCallbackErrorPageFault` |
+| `descriptor_leak` | `force_reload` | `[metal::malloc] Resource limit (N) exceeded` (mlx-lm#1185) |
+| `process_abort` | `respawn_now` | MetalStream SIGABRT, generic command buffer failure |
+
+INVARIANT: kernel-panic entries are first in the priority table; when
+both kernel + abort signatures appear in one log, kernel wins so the
+abort counter doesn't double-count machines that already rebooted.
+
+`SubprocessCrashError` now auto-classifies `detail` on construction
+and exposes `error_class` + `recovery_hint` for caller routing.
+
+### Added — L10b: process-abort scanner
+
+- `scan_recent_aborts(hours=24.0)` — sibling to `scan_recent_panics`
+  but for non-rebooting failures (default 24h vs 72h window since
+  aborts decay quicker).
+- `AbortRecord` dataclass with `error_class` field.
+- `CooldownVerdict.abort_count_24h` — informational only, exposed for
+  dashboard surface but **does NOT** influence `exit_code`. The
+  staircase lockout remains reserved for kernel panics that actually
+  rebooted the machine.
+
+### Added — L13b: Apple GPU family detection
+
+- `apple_gpu_family() -> dict` reads `mx.device_info()`:
+  `architecture`, `resource_limit`, `max_buffer_length`,
+  `max_recommended_working_set_size`, `memory_size`. Maps to family
+  `M1` / `M2` / `M3` / `M4` / `M5` via `applegpu_g13` / `g14` / `g15`
+  / `g16` / `g17` prefix. mlx-lm#1206 hypothesises that
+  `applegpu_g17s` (M5 Max) has command-buffer limits independent of
+  RAM, so per-family classification feeds `KNOWN_PANIC_MODELS`
+  filtering.
+
+### Added — L14: descriptor-leak heuristic
+
+- `ResourceTracker(cold_restart_after=4000)` — thread-safe inference
+  counter targeting mlx-lm#1185 descriptor leak (Resource limit
+  exceeded). Caller calls `record_inference()` after each generate;
+  `should_cold_restart()` returns True at threshold so caller can
+  shutdown + spawn new subprocess to release accumulated descriptors.
+  `mx.clear_cache()` releases buffers but descriptor handles
+  accumulate independently — only subprocess respawn fully releases.
+- Env knobs: `METALGUARD_COLD_RESTART_AFTER_N`,
+  `METALGUARD_COLD_RESTART_DISABLED=1` (kill switch).
+
+### Added — `breadcrumb_with_meta()`
+
+- `metal_guard.breadcrumb_with_meta(tag, payload, **meta)` — structured
+  breadcrumb format `[ts] TAG: payload | k1=v1 k2=v2`. Lets caller
+  attach `ctx`, `kv_bytes`, `elapsed_ms`, `tok_out`, `error_class`,
+  `descriptor_used` for richer postmortem forensics.
+- L11 `_BREADCRUMB_LINE_RE` updated to lazy regex with optional `meta`
+  capture group — backward-compatible with legacy `breadcrumb()`
+  callers.
+
+### Changed — `KNOWN_PANIC_MODELS` schema
+
+Schema upgrade adds three optional fields to each entry (legacy fields
+preserved for backward-compat with v0.9 / v0.10 callers):
+
+- `tier`: `"panic"` (kernel-level, reboots Mac) /
+  `"abort"` (process-level SIGABRT or hang) /
+  `"degradation"` (slow descriptor leak, no abort).
+- `error_classes[]`: list of distinct failure modes per model. Each
+  entry has `type` / `signature` / `first_seen_via` / `hardware` /
+  `gpu_family` / `workload` / `mitigation`. Multiple modes per model
+  (e.g. mlx-vlm#1064 has both `Hang` and `PageFault` variants).
+- `verified_safe_alternative`: known-safe pivot model_id.
+
+New helper functions:
+
+- `check_known_panic_model_for_gpu(model_id, gpu_family="M5")` —
+  filters `error_classes` by GPU family. Returns None when the model
+  is in registry but no error_classes apply to your hardware.
+- `models_by_tier(tier)` — query by severity tier.
+- `models_affecting_gpu_family(family)` — list models confirmed on
+  family.
+
+### Added — 4 new `KNOWN_PANIC_MODELS` entries
+
+- `mlx-community/Qwen3.5-27B-4bit` — degradation (LoRA descriptor leak,
+  M4 Max, mlx-lm#1185).
+- `mlx-community/Qwen3.5-35B-A3B-8bit` — degradation + abort (LoRA
+  leak #1185 + long-context streaming abort omlx#578).
+- `mlx-community/Qwen3.6-35B-A3B-8bit` — abort (DFlash drafter,
+  omlx#902). Mitigation: disable DFlash.
+- `mlx-community/Qwen3-VL-2B-Instruct` — abort (M5 Max GPU hang +
+  page fault, mlx-vlm#1064). Mitigation: avoid M5 Max; M1-M4 untested.
+
+The original `gemma-4-31b-it-8bit` entry retains its legacy fields and
+adds the new schema fields.
+
+### Notes
+
+The earliest test of the registry's value: **v0.11.0 ships data on
+five distinct (model × hardware × workload) combinations, not just
+one**. If a user on M5 Max hits Qwen3-VL hang, they can now query
+metal-guard before debugging upstream. If a user on M4 Max starts a
+LoRA on Qwen3.5-27B, they can wire `ResourceTracker` from day one
+instead of waiting for their first `Resource limit exceeded` crash.
+
+Bump `pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"`.
+
 ## [0.10.0] — 2026-04-27
 
 Promotes four Harper-private defence layers (`L10`-`L13`) to the public
 
@@ -6,7 +6,7 @@ Apple Silicon 上で [MLX](https://github.com/ml-explore/mlx) を動かすため
 
 MLX 推論中に Metal ドライバーのバグが引き起こすカーネルパニックや OOM クラッシュを防止します —— 特にマルチモデルパイプライン、長時間稼働サーバー、ツール呼び出しが多いエージェントフレームワークを想定しています。
 
-**現在のバージョン：v0.10.0** — リリース履歴と各機能の背景は [CHANGELOG.md](CHANGELOG.md) を参照してください。
+**現在のバージョン：v0.11.0** — リリース履歴と各機能の背景は [CHANGELOG.md](CHANGELOG.md) を参照してください。
 
 ### v0.10 で追加されたもの
 
@@ -124,13 +124,13 @@ panic(cpu 4 caller 0xfffffe0032a550f8):
 タグリリースからインストール —— `metal-guard` と `mlx-safe-python` の console scripts に加え、`metal_guard` Python モジュールが入ります：
 
 ```bash
-pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
+pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
 ```
 
 インストール後：
 
 ```bash
-metal-guard --version          # → metal-guard 0.10.0
+metal-guard --version          # → metal-guard 0.11.0
 metal-guard panic-gate         # L10 cooldown 判定
 metal-guard status             # フルスナップショット
 mlx-safe-python -c "import torch"   # 対話シェルガード
@@ -145,7 +145,7 @@ mlx-safe-python -c "import torch"   # 対話シェルガード
 ```bash
 mkdir -p ~/lib/metal-guard
 curl -L -o ~/lib/metal-guard/metal_guard.py \
-  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
+  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
 ```
 
 コード内で：
@@ -179,7 +179,7 @@ $ metal-guard panic-gate
 🟢 PROCEED  no recent IOGPU panics
   24h=0 72h=0
 $ metal-guard status
-metal-guard 0.10.0  🟢 OK
+metal-guard 0.11.0  🟢 OK
   mode        defensive — defensive mode (default)
   panics      0 in last 72h
   ...
 
@@ -6,7 +6,20 @@ GPU safety layer for [MLX](https://github.com/ml-explore/mlx) on Apple Silicon.
 
 Prevents kernel panics and OOM crashes caused by Metal driver bugs when running MLX inference — especially multi-model pipelines, long-running servers, and agent frameworks with heavy tool calling.
 
-**Current version: v0.10.0** — see [CHANGELOG.md](CHANGELOG.md) for release history and per-feature rationale.
+**Current version: v0.11.0** — see [CHANGELOG.md](CHANGELOG.md) for release history and per-feature rationale.
+
+### What's in v0.11
+
+Built on the 2026-04-27 community sweep (mlx-lm#1185 / #1206 / mlx-vlm#1064 / omlx#578 / #862 / #902):
+
+- **`error_classifier`** — central regex table for **6 distinct error severities**: `kernel_panic` / `process_abort` / `command_buffer_oom` / `gpu_hang` / `gpu_page_fault` / `descriptor_leak`. `SubprocessCrashError` now exposes `.error_class` + `.recovery_hint` for caller routing.
+- **L10b — process-abort scanner** — `scan_recent_aborts(24h)` sibling to `scan_recent_panics(72h)`. Aborts are non-rebooting failures; counted separately so they don't trip the kernel-panic lockout. `CooldownVerdict.abort_count_24h` exposed for dashboards.
+- **L13b — Apple GPU family detection** — `apple_gpu_family()` reads `mx.device_info()` and maps `applegpu_g13`/`g14`/`g15`/`g16`/`g17` → `M1`/`M2`/`M3`/`M4`/`M5`. Surfaces `resource_limit` (mlx-lm#1185 descriptor cap, 499000 on M1 Ultra).
+- **L14 — descriptor-leak heuristic** — `ResourceTracker(cold_restart_after=4000)` tracks inferences-since-cold-restart so callers can pre-emptively `shutdown()` + spawn new subprocess before hitting the descriptor limit. `mx.clear_cache()` doesn't release descriptor handles; only subprocess respawn does.
+- **`breadcrumb_with_meta(tag, payload, **meta)`** — structured breadcrumb format `[ts] TAG: payload | k=v k=v` for richer postmortem forensics. L11 orphan parser updated to lazy regex (backward-compat with legacy `breadcrumb()`).
+- **`KNOWN_PANIC_MODELS` schema upgrade** — adds `tier` (panic / abort / degradation), `error_classes[]` (multiple modes per model + per-GPU-family confirmation), `verified_safe_alternative`. New helpers: `check_known_panic_model_for_gpu(model, gpu_family="M5")` / `models_by_tier()` / `models_affecting_gpu_family()`.
+- **4 new registry entries** covering Qwen3.5/Qwen3.6/Qwen3-VL family across M4 / M5 hardware.
+- **Hotfix**: PEP 639 license-classifier conflict in `pyproject.toml` that blocked every `pip install` since v0.9.0.
 
 ### What's in v0.10
 
@@ -127,13 +140,13 @@ This affects any workflow that loads and unloads multiple MLX models in sequence
 Installs from a tagged release — gives you the `metal-guard` and `mlx-safe-python` console scripts plus the `metal_guard` Python module:
 
 ```bash
-pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
+pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
 ```
 
 After install:
 
 ```bash
-metal-guard --version          # → metal-guard 0.10.0
+metal-guard --version          # → metal-guard 0.11.0
 metal-guard panic-gate         # L10 cooldown verdict
 metal-guard status             # full snapshot
 mlx-safe-python -c "import torch"   # interactive shell guard
@@ -148,7 +161,7 @@ To upgrade to a future release: `pip install --upgrade "git+https://github.com/H
 ```bash
 mkdir -p ~/lib/metal-guard
 curl -L -o ~/lib/metal-guard/metal_guard.py \
-  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
+  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
 ```
 
 Then in your code:
@@ -182,7 +195,7 @@ $ metal-guard panic-gate
 🟢 PROCEED  no recent IOGPU panics
   24h=0 72h=0
 $ metal-guard status
-metal-guard 0.10.0  🟢 OK
+metal-guard 0.11.0  🟢 OK
   mode        defensive — defensive mode (default)
   panics      0 in last 72h
   ...
@@ -238,7 +251,7 @@ metal_guard.start_kv_cache_monitor(headroom_gb=config["kv_headroom_gb"])
 MetalGuard is organised as **defence layers (L1–L13)** plus a set of
 **preventive helpers (R-series)** and the **`KNOWN_PANIC_MODELS` registry**.
 Every feature is available from the single `metal_guard` module — install via
-`pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"` or
+`pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"` or
 drop `metal_guard.py` in your `PYTHONPATH` (see [Installation](#installation) above). See [CHANGELOG.md](CHANGELOG.md) for when
 each layer landed and the incident that motivated it.
 
 
@@ -6,7 +6,7 @@ Apple Silicon 上 [MLX](https://github.com/ml-explore/mlx) 的 GPU 安全層。
 
 防止 MLX 推論時 Metal 驅動程式 bug 造成的 kernel panic 與 OOM 崩潰 —— 尤其是多模型 pipeline、長時間運行的 server、以及大量 tool calling 的 agent 框架。
 
-**當前版本：v0.10.0** — 發行歷史與每個功能的背景見 [CHANGELOG.md](CHANGELOG.md)。
+**當前版本：v0.11.0** — 發行歷史與每個功能的背景見 [CHANGELOG.md](CHANGELOG.md)。
 
 ### v0.10 帶來什麼
 
@@ -124,13 +124,13 @@ panic(cpu 4 caller 0xfffffe0032a550f8):
 從 tag release 裝 —— 會拿到 `metal-guard` 跟 `mlx-safe-python` 兩個 console scripts 跟 `metal_guard` Python module：
 
 ```bash
-pip install "git+https://github.com/Harperbot/metal-guard.git@v0.10.0"
+pip install "git+https://github.com/Harperbot/metal-guard.git@v0.11.0"
 ```
 
 裝完：
 
 ```bash
-metal-guard --version          # → metal-guard 0.10.0
+metal-guard --version          # → metal-guard 0.11.0
 metal-guard panic-gate         # L10 cooldown 判斷
 metal-guard status             # 完整 snapshot
 mlx-safe-python -c "import torch"   # 互動 shell 守衛
@@ -145,7 +145,7 @@ mlx-safe-python -c "import torch"   # 互動 shell 守衛
 ```bash
 mkdir -p ~/lib/metal-guard
 curl -L -o ~/lib/metal-guard/metal_guard.py \
-  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.10.0/metal_guard.py
+  https://raw.githubusercontent.com/Harperbot/metal-guard/v0.11.0/metal_guard.py
 ```
 
 程式裡：
@@ -179,7 +179,7 @@ $ metal-guard panic-gate
 🟢 PROCEED  no recent IOGPU panics
   24h=0 72h=0
 $ metal-guard status
-metal-guard 0.10.0  🟢 OK
+metal-guard 0.11.0  🟢 OK
   mode        defensive — defensive mode (default)
   panics      0 in last 72h
   ...