docs: surface KNOWN_PANIC_MODELS as community-curated registry

Harper · Harper · commit 205f6ab21366 · 2026-04-27T20:40:34.000+08:00
Make the panic registry discoverable. v0.9.0 shipped the data structure + API but the README didn't surface it as a contribution path — users landed on the repo for "kernel panic IOGPUMemory" and found defensive layers but not the registry their report could feed. Changes: - README.md / README.zh-TW.md / README.ja.md: prominent section above "The Problem" framing the registry as community-curated, with quick example + per-field schema summary + how to contribute - .github/ISSUE_TEMPLATE/known-panic-report.yml: structured form (model_id / hardware dropdown / RAM / macOS build / panic_signature / workload / time_to_panic / metal_guard_state / verified workaround / cross_references / 3 confirmation checkboxes) - CONTRIBUTING.md: full schema docs (required vs optional fields, quality bar, full example entry, submission workflow) Why now: organic adoption signals are appearing — mlx#3186 cited "@Harperbot's two-trigger-path hypothesis", omlx#862 user attempting to use metal-guard. The registry needs a frictionless contribution path before the next wave of users hit panics on different model × hardware combos.
diff --git a/.github/ISSUE_TEMPLATE/known-panic-report.yml b/.github/ISSUE_TEMPLATE/known-panic-report.yml
@@ -0,0 +1,156 @@
+name: Known Panic Model Report
+description: Report an MLX model that kernel-panics Apple Silicon — proposes addition to KNOWN_PANIC_MODELS
+title: "[panic] <model-id> on <hardware>"
+labels: ["known-panic-models", "panic-report"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for contributing to the [Community Panic Registry](https://github.com/Harperbot/metal-guard#-community-panic-registry--known_panic_models)!
+
+        **Before filing**: please confirm the panic happened **with metal-guard's defensive layers engaged** (cross-process lock, cross-model cadence, cleanup helpers). The registry tracks panics that survive metal-guard mitigation — not panics that metal-guard already prevents.
+
+        Also check existing entries in [`KNOWN_PANIC_MODELS`](../blob/main/metal_guard.py) — if your model is already there, please add your data point as a comment on the existing tracking issue rather than filing a new one.
+
+  - type: input
+    id: model_id
+    attributes:
+      label: Model ID
+      description: Full HuggingFace path. Use the exact ID `mlx_lm.load()` / `mlx_vlm.load()` accepts.
+      placeholder: "mlx-community/gemma-4-31b-it-8bit"
+    validations:
+      required: true
+
+  - type: dropdown
+    id: hardware
+    attributes:
+      label: Hardware
+      options:
+        - "M1 (base)"
+        - "M1 Pro"
+        - "M1 Max"
+        - "M1 Ultra"
+        - "M2 (base)"
+        - "M2 Pro"
+        - "M2 Max"
+        - "M2 Ultra"
+        - "M3 (base)"
+        - "M3 Pro"
+        - "M3 Max"
+        - "M3 Ultra"
+        - "M4 (base)"
+        - "M4 Pro"
+        - "M4 Max"
+        - "M4 Ultra"
+        - "M5 (base)"
+        - "M5 Pro"
+        - "M5 Max"
+        - "Other (specify in workload field)"
+    validations:
+      required: true
+
+  - type: input
+    id: ram
+    attributes:
+      label: Unified memory (GB)
+      placeholder: "64"
+    validations:
+      required: true
+
+  - type: input
+    id: macos
+    attributes:
+      label: macOS version (build)
+      description: Run `sw_vers` to get the build number.
+      placeholder: "26.4.1 (25E253)"
+    validations:
+      required: true
+
+  - type: textarea
+    id: panic_signature
+    attributes:
+      label: Panic signature
+      description: |
+        Open `/Library/Logs/DiagnosticReports/Retired/panic-full-*.panic` (Full Disk Access required) and paste the `panic(...)` line and the IOGPUMemory.cpp:NNN reference. Redact serial numbers / UUIDs as needed.
+      placeholder: |
+        panic(cpu 4 caller 0xfffffe0032a550f8):
+          "completeMemory() prepare count underflow" @IOGPUMemory.cpp:492
+      render: text
+    validations:
+      required: true
+
+  - type: textarea
+    id: workload
+    attributes:
+      label: Workload pattern
+      description: |
+        What was the model doing when it panicked? Examples:
+        - Single-shot generate (prompt size, max tokens)
+        - Long-running server (uptime, request rate, KV cache size)
+        - Multi-model pipeline (which models cycled, in what order)
+        - Concurrent calls (how many threads / processes)
+      placeholder: |
+        - mlx_lm.server, ~8 min uptime
+        - 1× 23k-token prefill (completed OK), then 3× isolated short-prompt generate
+        - Panic ~30s after 3rd short-prompt generate
+        - Sequential through HTTP, no concurrent client requests
+        - --prompt-cache-bytes 8589934592 set
+    validations:
+      required: true
+
+  - type: input
+    id: time_to_panic
+    attributes:
+      label: Time from worker-ready to panic
+      description: How long after the model finished loading did the panic occur? Use "first call" if it panicked on the first generate.
+      placeholder: "8 min 16 s"
+    validations:
+      required: true
+
+  - type: textarea
+    id: metal_guard_state
+    attributes:
+      label: metal-guard configuration in effect
+      description: Which of the L1–L9 layers were active? Output of `metal_guard.describe_mode()` if available.
+      placeholder: |
+        - L1 thread tracking: yes (3 threads registered)
+        - L7 subprocess isolation: no (in-process)
+        - L8 cross-process lock: yes
+        - L9 CadenceGuard: yes (180s)
+        - L9 CircuitBreaker: yes (24h, panic_count=0 prior)
+        - METALGUARD_MODE=defensive
+    validations:
+      required: true
+
+  - type: textarea
+    id: workaround
+    attributes:
+      label: Verified workaround (if any)
+      description: |
+        What did you switch to that does NOT panic on the same hardware + workload? Concrete data points are most valuable — "switched to Ollama and ran X requests over Y days without panic".
+      placeholder: |
+        Switched to llama.cpp on same hardware, same model family — ran 24 hours under same workload, 0 panics.
+        OR
+        Pivoted to mlx-community/gemma-4-26b-a4b-it-4bit (MoE variant) — same panic-free pattern.
+
+  - type: textarea
+    id: cross_references
+    attributes:
+      label: Cross-references
+      description: |
+        Other places this panic has been reported — GitHub issue threads, lmstudio bugs, forum posts, Discord screenshots. Helps establish reproducibility across users.
+      placeholder: |
+        - ml-explore/mlx#3186 (similar M4 base 32GB report)
+        - lmstudio bug #1740
+
+  - type: checkboxes
+    id: confirmations
+    attributes:
+      label: Confirmations
+      options:
+        - label: "I confirmed metal-guard's defensive layers were engaged when the panic occurred"
+          required: true
+        - label: "I checked existing `KNOWN_PANIC_MODELS` entries and this model is not already listed"
+          required: true
+        - label: "I'm willing to share the redacted panic-full-*.panic file if maintainers ask for it"
+          required: false
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,90 @@
+# Contributing to metal-guard
+
+Thanks for considering a contribution. This document covers the two highest-leverage contribution paths: **panic reports for the registry** and **code / docs PRs**.
+
+## Known Panic Models — schema
+
+`KNOWN_PANIC_MODELS` is a community-curated dict of MLX model IDs that kernel-panic Apple Silicon Macs *even with metal-guard's defensive layers engaged*. The schema is intentionally rich so the entry stays useful when others read it months later.
+
+### Required fields
+
+| Field | Type | Description |
+|---|---|---|
+| `panic_signature` | str | Exact `IOGPUMemory.cpp:NNN` line + keyword. Match the C++ source location, not just the panic string — Apple sometimes renames the human-readable text but keeps the line number. |
+| `first_observed` | str (ISO date `YYYY-MM-DD`) | First reproduction. |
+| `last_observed` | str (ISO date `YYYY-MM-DD`) | Most recent reproduction. Bump on each new data point. |
+| `reproductions` | list[str] | Production data points. Each entry must include hardware + RAM + time-to-panic + workload summary. Format: `"<hardware> <ram>GB — <date> — <duration> from worker-ready — <workload one-liner>"`. |
+| `recommendation` | str | Actionable workaround. Specific (backend / model / config) is more useful than generic ("be careful"). Cite the metal-guard version that was tried — recommendations age. |
+| `upstream` | list[str] | URLs of upstream tracking issues (mlx / mlx-lm / mlx-vlm GitHub). At least one. |
+
+### Optional fields
+
+| Field | Type | Description |
+|---|---|---|
+| `community` | list[str] | External cross-references (GitHub comments by other users, lmstudio bugs, forum threads). Strengthens "this isn't just one user". |
+| `panic_by_hardware` | dict | Reserved for v0.10+ schema upgrade — per-hardware observation matrix. Don't add yet. |
+| `notes` | str | Caveats, environmental specifics, anything that would surprise the next reader. |
+
+### Quality bar
+
+Entries are conservative by design. We accept either:
+
+1. **A clean production reproduction** — same hardware reproducing the same panic signature on the same model, with metal-guard's L7/L8/L9 layers active. One-shot anecdotes go in `community` not `reproductions`.
+2. **A confirmed upstream issue** with the same panic signature where the model is named in the issue body or a maintainer comment.
+
+We **do not** accept:
+- "Sometimes panics, sometimes doesn't" without a reproduction recipe
+- Models that only panic without metal-guard engaged (those go in the README's "who is affected" section, not the registry)
+- Models whose panic was clearly a different root cause (OOM-on-load, transformers ImportError, etc.) — those have separate handling in `_VERSION_ADVISORIES`
+
+### Example entry
+
+```python
+"mlx-community/gemma-4-31b-it-8bit": {
+    "panic_signature": "IOGPUMemory.cpp:492 prepare_count_underflow",
+    "first_observed": "2026-04-23",
+    "last_observed": "2026-04-24",
+    "reproductions": [
+        "M1 Ultra 64GB — 2026-04-23 03:14 local — ~6 min from worker-ready — "
+        "subprocess worker, pre-cross-model-cadence, gemma-4 first-gen flush absent",
+        "M1 Ultra 64GB — 2026-04-24 03:14 local — ~1.5 min from worker-ready — "
+        "same pipeline, post-fix attempt, panicked sooner",
+    ],
+    "community": [
+        "Hannecke (M4 Max 64GB) — ml-explore/mlx#3186 — pivoted to "
+        "Qwen3-Coder-30B-A3B MoE",
+        "lmstudio bug #1740 — hybrid attention (50 sliding + 10 global) "
+        "KV cache 8-bit weights 34GB + full ctx KV 20GB+ > 54GB",
+        "ml-explore/mlx-lm#883 (M3 Ultra 96GB)",
+    ],
+    "recommendation": (
+        "metal-guard v0.9.0 narrows the race window via cross-model cadence "
+        "(C5) + gemma4_generation_flush (C7) + subprocess_inference_guard "
+        "(B1), but does NOT eliminate panic on this model in production "
+        "workloads. Switch backend (Ollama / llama.cpp) or pivot to MoE "
+        "variant (e.g. mlx-community/gemma-4-26b-a4b-it-4bit)."
+    ),
+    "upstream": [
+        "https://github.com/ml-explore/mlx/issues/3186",
+        "https://github.com/ml-explore/mlx-lm/issues/883",
+        "https://github.com/ml-explore/mlx/issues/3346",
+    ],
+},
+```
+
+### How to submit
+
+1. **File a [Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml)** — issue template walks through the schema. Maintainers will draft the dict entry from your report.
+2. **OR** open a PR directly modifying `KNOWN_PANIC_MODELS` in `metal_guard.py`. Include the issue number you opened first so reviewers can cross-check.
+
+Maintainers may ask for additional data — typically the redacted panic-full-*.panic file (Full Disk Access on macOS required to read) — to confirm the signature before merging.
+
+## Code / docs PRs
+
+Standard GitHub flow. Run `pytest` before submitting. CHANGELOG.md update is required for behavioural changes; not required for typo fixes / docs polish.
+
+If your PR adds a new defence layer (L10+), please also extend the test matrix to cover the new layer's failure modes.
+
+## License
+
+By contributing you agree your contribution is licensed under the same MIT license as the project.
diff --git a/README.ja.md b/README.ja.md
@@ -32,6 +32,41 @@ MLX を動かしている Mac が panic / 再起動 / クラッシュし、以
 
 関連する upstream のトラッキング：`ml-explore/mlx#3186` / `#3346` / `#3348` / `#3350` / `#3384` / `#3390`、`ml-explore/mlx-lm#883` / `#854` / `#897` / `#1015` / `#1047`、`Blaizzy/mlx-vlm#943` / `#967` / `#999` / `#1011` / `#1016`。metal-guard は `check_version_advisories()` でこれらを監視し、インストール済みバージョンが影響を受ける場合は起動時に警告します。
 
+## 📋 コミュニティ panic モデルレジストリ — `KNOWN_PANIC_MODELS`
+
+**Apple Silicon Mac でカーネルパニックを引き起こす MLX モデルを、ユーザーが共同で整理したリスト。** ハードウェア文脈・根本原因の仮説・検証済みワークアラウンドを含みます。
+
+Apple の IOGPUFamily ドライバーバグには修正の見通しがありません。バグそのものは upstream の問題ですが、**どのモデルがどのワークロードで発火するかはコミュニティが知り得る事柄** —— ただし現在は GitHub issue スレッド、lmstudio バグ報告、Discord スクリーンショット、誰も公開していない `panic-full-*.panic` ファイルに散在しています。
+
+metal-guard はその知識のための構造化された置き場を提供します：
+
+```python
+from metal_guard import check_known_panic_model, warn_if_known_panic_model
+
+# ロード前にチェック
+advisory = check_known_panic_model("mlx-community/gemma-4-31b-it-8bit")
+if advisory is not None:
+    print(advisory["recommendation"])
+
+# あるいはロード時に fire-and-forget 警告（プロセスごと model_id ごと一回のみ）
+warn_if_known_panic_model(model_id)
+```
+
+各エントリには：
+- **`panic_signature`** — `panic-full-*.panic` ログと照合する正確な `IOGPUMemory.cpp:NNN` 行 + キーワード
+- **`reproductions`** — production データポイント（ハードウェア / RAM / panic までの時間 / ワークロード）
+- **`community`** — 同じ panic に当たった他者の GitHub issue / lmstudio バグ / フォーラムスレッドへの相互参照
+- **`recommendation`** — 実行可能なワークアラウンド（バックエンド切替 / モデル変更 / cadence 設定）
+- **`upstream`** — 根底のドライバーバグを追跡する GitHub issue リンク
+
+### 貢献方法
+
+特定の MLX モデルでカーネルパニックに遭遇し、**かつ metal-guard の防御層が有効な状態**だった場合、あなたのデータポイントは価値があります。[Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml) を作成してください — issue テンプレートが schema（モデル ID / ハードウェア / panic シグネチャ / ワークロード / panic までの時間 / 検証済みワークアラウンド）を案内します。Schema ドキュメント： [CONTRIBUTING.md](CONTRIBUTING.md#known-panic-models-schema)。
+
+レジストリは設計上保守的です — 採用条件は「production での明確な再現」または「上流 issue に明示的なシグネチャがある」のいずれか。多数のユーザーで正常動作するモデルを誤って blacklist することは避けたいためです。
+
+**なぜ mlx#3186 のコメントを読むだけで済まないのか？** あのスレッドはハードウェア報告、仮説、修正試行、関係ない議論が混在しているからです。レジストリはそれを `check_known_panic_model()` で問い合わせ可能な構造化アドバイザリに蒸留します — そしてあなたの panic 報告が 50 件のコメントに埋もれることもありません。
+
 ## 問題
 
 Apple Silicon の Metal GPU ドライバーにはバグがあり、GPU メモリ管理が失敗したときに **プロセスをきれいに落とす代わりにカーネルが panic してマシンごと落ちます**。
diff --git a/README.md b/README.md
@@ -32,6 +32,44 @@ If your Mac is panicking / rebooting / crashing while running MLX and you search
 
 Related upstream tracking: `ml-explore/mlx#3186` / `#3346` / `#3348` / `#3350` / `#3384` / `#3390`, `ml-explore/mlx-lm#883` / `#854` / `#897` / `#1015` / `#1047`, `Blaizzy/mlx-vlm#943` / `#967` / `#999` / `#1011` / `#1016`. metal-guard watches these via `check_version_advisories()` and warns at startup if the installed versions are affected.
 
+## 📋 Community Panic Registry — `KNOWN_PANIC_MODELS`
+
+**A user-curated list of MLX models that kernel-panic Apple Silicon Macs in production, with hardware contexts, root-cause hypotheses, and verified workarounds.**
+
+Apple's IOGPUFamily driver bug has no fix timeline. While the bug is upstream, **which models trigger it under which workloads is a community-knowable thing** — but it's currently scattered across GitHub issue threads, lmstudio bug reports, Discord screenshots, and individual `panic-full-*.panic` files nobody publishes.
+
+metal-guard provides a structured home for this knowledge:
+
+```python
+from metal_guard import check_known_panic_model, warn_if_known_panic_model
+
+# Check before loading
+advisory = check_known_panic_model("mlx-community/gemma-4-31b-it-8bit")
+if advisory is not None:
+    print(advisory["recommendation"])
+    # → "metal-guard v0.9.0 narrows the race window... but does NOT eliminate
+    #    panic on this model. Switch backend (Ollama / llama.cpp) or pivot
+    #    to MoE variant (e.g. mlx-community/gemma-4-26b-a4b-it-4bit)."
+
+# Or fire-and-forget warning at load time (per-process dedup)
+warn_if_known_panic_model(model_id)
+```
+
+Each entry carries:
+- **`panic_signature`** — the exact `IOGPUMemory.cpp:NNN` line + keyword to match against your `panic-full-*.panic` log
+- **`reproductions`** — production data points (hardware, RAM, time-to-panic, workload)
+- **`community`** — cross-references to GitHub issues / lmstudio bugs / forum threads where others hit the same panic
+- **`recommendation`** — actionable workaround (backend switch / model pivot / cadence config)
+- **`upstream`** — links to the GitHub issues tracking the underlying driver bug
+
+### How to contribute
+
+If you've hit a kernel panic on a specific MLX model **with metal-guard's defensive layers fully engaged**, your data point is valuable. Open a [Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml) — the template walks you through the schema (model ID / hardware / panic signature / workload / time-to-panic / verified workaround). Schema docs in [CONTRIBUTING.md](CONTRIBUTING.md#known-panic-models-schema).
+
+The registry is intentionally conservative — entries require either a confirmed production reproduction or a clear upstream issue with reproducible signature. We don't want false positives blacklisting models that work fine for most users.
+
+**Why not just read mlx#3186 comments?** Because that thread mixes hardware reports, hypotheses, attempted fixes, and unrelated discussion. The registry distils it into structured advisory data your code can `check_known_panic_model()` against — and your panic report doesn't disappear into a 50-comment thread.
+
 ## The Problem
 
 Apple's Metal GPU driver on Apple Silicon has a bug: when GPU memory management fails, **the kernel panics the entire machine** instead of gracefully killing the process.
diff --git a/README.zh-TW.md b/README.zh-TW.md