Skip to content

Commit 205f6ab

Browse files
HarperHarper
authored andcommitted
docs: surface KNOWN_PANIC_MODELS as community-curated registry
Make the panic registry discoverable. v0.9.0 shipped the data structure + API but the README didn't surface it as a contribution path — users landed on the repo for "kernel panic IOGPUMemory" and found defensive layers but not the registry their report could feed. Changes: - README.md / README.zh-TW.md / README.ja.md: prominent section above "The Problem" framing the registry as community-curated, with quick example + per-field schema summary + how to contribute - .github/ISSUE_TEMPLATE/known-panic-report.yml: structured form (model_id / hardware dropdown / RAM / macOS build / panic_signature / workload / time_to_panic / metal_guard_state / verified workaround / cross_references / 3 confirmation checkboxes) - CONTRIBUTING.md: full schema docs (required vs optional fields, quality bar, full example entry, submission workflow) Why now: organic adoption signals are appearing — mlx#3186 cited "@Harperbot's two-trigger-path hypothesis", omlx#862 user attempting to use metal-guard. The registry needs a frictionless contribution path before the next wave of users hit panics on different model × hardware combos.
1 parent 05387d2 commit 205f6ab

5 files changed

Lines changed: 354 additions & 0 deletions

File tree

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
name: Known Panic Model Report
2+
description: Report an MLX model that kernel-panics Apple Silicon — proposes addition to KNOWN_PANIC_MODELS
3+
title: "[panic] <model-id> on <hardware>"
4+
labels: ["known-panic-models", "panic-report"]
5+
body:
6+
- type: markdown
7+
attributes:
8+
value: |
9+
Thanks for contributing to the [Community Panic Registry](https://github.com/Harperbot/metal-guard#-community-panic-registry--known_panic_models)!
10+
11+
**Before filing**: please confirm the panic happened **with metal-guard's defensive layers engaged** (cross-process lock, cross-model cadence, cleanup helpers). The registry tracks panics that survive metal-guard mitigation — not panics that metal-guard already prevents.
12+
13+
Also check existing entries in [`KNOWN_PANIC_MODELS`](../blob/main/metal_guard.py) — if your model is already there, please add your data point as a comment on the existing tracking issue rather than filing a new one.
14+
15+
- type: input
16+
id: model_id
17+
attributes:
18+
label: Model ID
19+
description: Full HuggingFace path. Use the exact ID `mlx_lm.load()` / `mlx_vlm.load()` accepts.
20+
placeholder: "mlx-community/gemma-4-31b-it-8bit"
21+
validations:
22+
required: true
23+
24+
- type: dropdown
25+
id: hardware
26+
attributes:
27+
label: Hardware
28+
options:
29+
- "M1 (base)"
30+
- "M1 Pro"
31+
- "M1 Max"
32+
- "M1 Ultra"
33+
- "M2 (base)"
34+
- "M2 Pro"
35+
- "M2 Max"
36+
- "M2 Ultra"
37+
- "M3 (base)"
38+
- "M3 Pro"
39+
- "M3 Max"
40+
- "M3 Ultra"
41+
- "M4 (base)"
42+
- "M4 Pro"
43+
- "M4 Max"
44+
- "M4 Ultra"
45+
- "M5 (base)"
46+
- "M5 Pro"
47+
- "M5 Max"
48+
- "Other (specify in workload field)"
49+
validations:
50+
required: true
51+
52+
- type: input
53+
id: ram
54+
attributes:
55+
label: Unified memory (GB)
56+
placeholder: "64"
57+
validations:
58+
required: true
59+
60+
- type: input
61+
id: macos
62+
attributes:
63+
label: macOS version (build)
64+
description: Run `sw_vers` to get the build number.
65+
placeholder: "26.4.1 (25E253)"
66+
validations:
67+
required: true
68+
69+
- type: textarea
70+
id: panic_signature
71+
attributes:
72+
label: Panic signature
73+
description: |
74+
Open `/Library/Logs/DiagnosticReports/Retired/panic-full-*.panic` (Full Disk Access required) and paste the `panic(...)` line and the IOGPUMemory.cpp:NNN reference. Redact serial numbers / UUIDs as needed.
75+
placeholder: |
76+
panic(cpu 4 caller 0xfffffe0032a550f8):
77+
"completeMemory() prepare count underflow" @IOGPUMemory.cpp:492
78+
render: text
79+
validations:
80+
required: true
81+
82+
- type: textarea
83+
id: workload
84+
attributes:
85+
label: Workload pattern
86+
description: |
87+
What was the model doing when it panicked? Examples:
88+
- Single-shot generate (prompt size, max tokens)
89+
- Long-running server (uptime, request rate, KV cache size)
90+
- Multi-model pipeline (which models cycled, in what order)
91+
- Concurrent calls (how many threads / processes)
92+
placeholder: |
93+
- mlx_lm.server, ~8 min uptime
94+
- 1× 23k-token prefill (completed OK), then 3× isolated short-prompt generate
95+
- Panic ~30s after 3rd short-prompt generate
96+
- Sequential through HTTP, no concurrent client requests
97+
- --prompt-cache-bytes 8589934592 set
98+
validations:
99+
required: true
100+
101+
- type: input
102+
id: time_to_panic
103+
attributes:
104+
label: Time from worker-ready to panic
105+
description: How long after the model finished loading did the panic occur? Use "first call" if it panicked on the first generate.
106+
placeholder: "8 min 16 s"
107+
validations:
108+
required: true
109+
110+
- type: textarea
111+
id: metal_guard_state
112+
attributes:
113+
label: metal-guard configuration in effect
114+
description: Which of the L1–L9 layers were active? Output of `metal_guard.describe_mode()` if available.
115+
placeholder: |
116+
- L1 thread tracking: yes (3 threads registered)
117+
- L7 subprocess isolation: no (in-process)
118+
- L8 cross-process lock: yes
119+
- L9 CadenceGuard: yes (180s)
120+
- L9 CircuitBreaker: yes (24h, panic_count=0 prior)
121+
- METALGUARD_MODE=defensive
122+
validations:
123+
required: true
124+
125+
- type: textarea
126+
id: workaround
127+
attributes:
128+
label: Verified workaround (if any)
129+
description: |
130+
What did you switch to that does NOT panic on the same hardware + workload? Concrete data points are most valuable — "switched to Ollama and ran X requests over Y days without panic".
131+
placeholder: |
132+
Switched to llama.cpp on same hardware, same model family — ran 24 hours under same workload, 0 panics.
133+
OR
134+
Pivoted to mlx-community/gemma-4-26b-a4b-it-4bit (MoE variant) — same panic-free pattern.
135+
136+
- type: textarea
137+
id: cross_references
138+
attributes:
139+
label: Cross-references
140+
description: |
141+
Other places this panic has been reported — GitHub issue threads, lmstudio bugs, forum posts, Discord screenshots. Helps establish reproducibility across users.
142+
placeholder: |
143+
- ml-explore/mlx#3186 (similar M4 base 32GB report)
144+
- lmstudio bug #1740
145+
146+
- type: checkboxes
147+
id: confirmations
148+
attributes:
149+
label: Confirmations
150+
options:
151+
- label: "I confirmed metal-guard's defensive layers were engaged when the panic occurred"
152+
required: true
153+
- label: "I checked existing `KNOWN_PANIC_MODELS` entries and this model is not already listed"
154+
required: true
155+
- label: "I'm willing to share the redacted panic-full-*.panic file if maintainers ask for it"
156+
required: false

CONTRIBUTING.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Contributing to metal-guard
2+
3+
Thanks for considering a contribution. This document covers the two highest-leverage contribution paths: **panic reports for the registry** and **code / docs PRs**.
4+
5+
## Known Panic Models — schema
6+
7+
`KNOWN_PANIC_MODELS` is a community-curated dict of MLX model IDs that kernel-panic Apple Silicon Macs *even with metal-guard's defensive layers engaged*. The schema is intentionally rich so the entry stays useful when others read it months later.
8+
9+
### Required fields
10+
11+
| Field | Type | Description |
12+
|---|---|---|
13+
| `panic_signature` | str | Exact `IOGPUMemory.cpp:NNN` line + keyword. Match the C++ source location, not just the panic string — Apple sometimes renames the human-readable text but keeps the line number. |
14+
| `first_observed` | str (ISO date `YYYY-MM-DD`) | First reproduction. |
15+
| `last_observed` | str (ISO date `YYYY-MM-DD`) | Most recent reproduction. Bump on each new data point. |
16+
| `reproductions` | list[str] | Production data points. Each entry must include hardware + RAM + time-to-panic + workload summary. Format: `"<hardware> <ram>GB — <date> — <duration> from worker-ready — <workload one-liner>"`. |
17+
| `recommendation` | str | Actionable workaround. Specific (backend / model / config) is more useful than generic ("be careful"). Cite the metal-guard version that was tried — recommendations age. |
18+
| `upstream` | list[str] | URLs of upstream tracking issues (mlx / mlx-lm / mlx-vlm GitHub). At least one. |
19+
20+
### Optional fields
21+
22+
| Field | Type | Description |
23+
|---|---|---|
24+
| `community` | list[str] | External cross-references (GitHub comments by other users, lmstudio bugs, forum threads). Strengthens "this isn't just one user". |
25+
| `panic_by_hardware` | dict | Reserved for v0.10+ schema upgrade — per-hardware observation matrix. Don't add yet. |
26+
| `notes` | str | Caveats, environmental specifics, anything that would surprise the next reader. |
27+
28+
### Quality bar
29+
30+
Entries are conservative by design. We accept either:
31+
32+
1. **A clean production reproduction** — same hardware reproducing the same panic signature on the same model, with metal-guard's L7/L8/L9 layers active. One-shot anecdotes go in `community` not `reproductions`.
33+
2. **A confirmed upstream issue** with the same panic signature where the model is named in the issue body or a maintainer comment.
34+
35+
We **do not** accept:
36+
- "Sometimes panics, sometimes doesn't" without a reproduction recipe
37+
- Models that only panic without metal-guard engaged (those go in the README's "who is affected" section, not the registry)
38+
- Models whose panic was clearly a different root cause (OOM-on-load, transformers ImportError, etc.) — those have separate handling in `_VERSION_ADVISORIES`
39+
40+
### Example entry
41+
42+
```python
43+
"mlx-community/gemma-4-31b-it-8bit": {
44+
"panic_signature": "IOGPUMemory.cpp:492 prepare_count_underflow",
45+
"first_observed": "2026-04-23",
46+
"last_observed": "2026-04-24",
47+
"reproductions": [
48+
"M1 Ultra 64GB — 2026-04-23 03:14 local — ~6 min from worker-ready — "
49+
"subprocess worker, pre-cross-model-cadence, gemma-4 first-gen flush absent",
50+
"M1 Ultra 64GB — 2026-04-24 03:14 local — ~1.5 min from worker-ready — "
51+
"same pipeline, post-fix attempt, panicked sooner",
52+
],
53+
"community": [
54+
"Hannecke (M4 Max 64GB) — ml-explore/mlx#3186 — pivoted to "
55+
"Qwen3-Coder-30B-A3B MoE",
56+
"lmstudio bug #1740 — hybrid attention (50 sliding + 10 global) "
57+
"KV cache 8-bit weights 34GB + full ctx KV 20GB+ > 54GB",
58+
"ml-explore/mlx-lm#883 (M3 Ultra 96GB)",
59+
],
60+
"recommendation": (
61+
"metal-guard v0.9.0 narrows the race window via cross-model cadence "
62+
"(C5) + gemma4_generation_flush (C7) + subprocess_inference_guard "
63+
"(B1), but does NOT eliminate panic on this model in production "
64+
"workloads. Switch backend (Ollama / llama.cpp) or pivot to MoE "
65+
"variant (e.g. mlx-community/gemma-4-26b-a4b-it-4bit)."
66+
),
67+
"upstream": [
68+
"https://github.com/ml-explore/mlx/issues/3186",
69+
"https://github.com/ml-explore/mlx-lm/issues/883",
70+
"https://github.com/ml-explore/mlx/issues/3346",
71+
],
72+
},
73+
```
74+
75+
### How to submit
76+
77+
1. **File a [Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml)** — issue template walks through the schema. Maintainers will draft the dict entry from your report.
78+
2. **OR** open a PR directly modifying `KNOWN_PANIC_MODELS` in `metal_guard.py`. Include the issue number you opened first so reviewers can cross-check.
79+
80+
Maintainers may ask for additional data — typically the redacted panic-full-*.panic file (Full Disk Access on macOS required to read) — to confirm the signature before merging.
81+
82+
## Code / docs PRs
83+
84+
Standard GitHub flow. Run `pytest` before submitting. CHANGELOG.md update is required for behavioural changes; not required for typo fixes / docs polish.
85+
86+
If your PR adds a new defence layer (L10+), please also extend the test matrix to cover the new layer's failure modes.
87+
88+
## License
89+
90+
By contributing you agree your contribution is licensed under the same MIT license as the project.

README.ja.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,41 @@ MLX を動かしている Mac が panic / 再起動 / クラッシュし、以
3232

3333
関連する upstream のトラッキング:`ml-explore/mlx#3186` / `#3346` / `#3348` / `#3350` / `#3384` / `#3390``ml-explore/mlx-lm#883` / `#854` / `#897` / `#1015` / `#1047``Blaizzy/mlx-vlm#943` / `#967` / `#999` / `#1011` / `#1016`。metal-guard は `check_version_advisories()` でこれらを監視し、インストール済みバージョンが影響を受ける場合は起動時に警告します。
3434

35+
## 📋 コミュニティ panic モデルレジストリ — `KNOWN_PANIC_MODELS`
36+
37+
**Apple Silicon Mac でカーネルパニックを引き起こす MLX モデルを、ユーザーが共同で整理したリスト。** ハードウェア文脈・根本原因の仮説・検証済みワークアラウンドを含みます。
38+
39+
Apple の IOGPUFamily ドライバーバグには修正の見通しがありません。バグそのものは upstream の問題ですが、**どのモデルがどのワークロードで発火するかはコミュニティが知り得る事柄** —— ただし現在は GitHub issue スレッド、lmstudio バグ報告、Discord スクリーンショット、誰も公開していない `panic-full-*.panic` ファイルに散在しています。
40+
41+
metal-guard はその知識のための構造化された置き場を提供します:
42+
43+
```python
44+
from metal_guard import check_known_panic_model, warn_if_known_panic_model
45+
46+
# ロード前にチェック
47+
advisory = check_known_panic_model("mlx-community/gemma-4-31b-it-8bit")
48+
if advisory is not None:
49+
print(advisory["recommendation"])
50+
51+
# あるいはロード時に fire-and-forget 警告(プロセスごと model_id ごと一回のみ)
52+
warn_if_known_panic_model(model_id)
53+
```
54+
55+
各エントリには:
56+
- **`panic_signature`**`panic-full-*.panic` ログと照合する正確な `IOGPUMemory.cpp:NNN` 行 + キーワード
57+
- **`reproductions`** — production データポイント(ハードウェア / RAM / panic までの時間 / ワークロード)
58+
- **`community`** — 同じ panic に当たった他者の GitHub issue / lmstudio バグ / フォーラムスレッドへの相互参照
59+
- **`recommendation`** — 実行可能なワークアラウンド(バックエンド切替 / モデル変更 / cadence 設定)
60+
- **`upstream`** — 根底のドライバーバグを追跡する GitHub issue リンク
61+
62+
### 貢献方法
63+
64+
特定の MLX モデルでカーネルパニックに遭遇し、**かつ metal-guard の防御層が有効な状態**だった場合、あなたのデータポイントは価値があります。[Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml) を作成してください — issue テンプレートが schema(モデル ID / ハードウェア / panic シグネチャ / ワークロード / panic までの時間 / 検証済みワークアラウンド)を案内します。Schema ドキュメント: [CONTRIBUTING.md](CONTRIBUTING.md#known-panic-models-schema)
65+
66+
レジストリは設計上保守的です — 採用条件は「production での明確な再現」または「上流 issue に明示的なシグネチャがある」のいずれか。多数のユーザーで正常動作するモデルを誤って blacklist することは避けたいためです。
67+
68+
**なぜ mlx#3186 のコメントを読むだけで済まないのか?** あのスレッドはハードウェア報告、仮説、修正試行、関係ない議論が混在しているからです。レジストリはそれを `check_known_panic_model()` で問い合わせ可能な構造化アドバイザリに蒸留します — そしてあなたの panic 報告が 50 件のコメントに埋もれることもありません。
69+
3570
## 問題
3671

3772
Apple Silicon の Metal GPU ドライバーにはバグがあり、GPU メモリ管理が失敗したときに **プロセスをきれいに落とす代わりにカーネルが panic してマシンごと落ちます**

README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,44 @@ If your Mac is panicking / rebooting / crashing while running MLX and you search
3232

3333
Related upstream tracking: `ml-explore/mlx#3186` / `#3346` / `#3348` / `#3350` / `#3384` / `#3390`, `ml-explore/mlx-lm#883` / `#854` / `#897` / `#1015` / `#1047`, `Blaizzy/mlx-vlm#943` / `#967` / `#999` / `#1011` / `#1016`. metal-guard watches these via `check_version_advisories()` and warns at startup if the installed versions are affected.
3434

35+
## 📋 Community Panic Registry — `KNOWN_PANIC_MODELS`
36+
37+
**A user-curated list of MLX models that kernel-panic Apple Silicon Macs in production, with hardware contexts, root-cause hypotheses, and verified workarounds.**
38+
39+
Apple's IOGPUFamily driver bug has no fix timeline. While the bug is upstream, **which models trigger it under which workloads is a community-knowable thing** — but it's currently scattered across GitHub issue threads, lmstudio bug reports, Discord screenshots, and individual `panic-full-*.panic` files nobody publishes.
40+
41+
metal-guard provides a structured home for this knowledge:
42+
43+
```python
44+
from metal_guard import check_known_panic_model, warn_if_known_panic_model
45+
46+
# Check before loading
47+
advisory = check_known_panic_model("mlx-community/gemma-4-31b-it-8bit")
48+
if advisory is not None:
49+
print(advisory["recommendation"])
50+
# → "metal-guard v0.9.0 narrows the race window... but does NOT eliminate
51+
# panic on this model. Switch backend (Ollama / llama.cpp) or pivot
52+
# to MoE variant (e.g. mlx-community/gemma-4-26b-a4b-it-4bit)."
53+
54+
# Or fire-and-forget warning at load time (per-process dedup)
55+
warn_if_known_panic_model(model_id)
56+
```
57+
58+
Each entry carries:
59+
- **`panic_signature`** — the exact `IOGPUMemory.cpp:NNN` line + keyword to match against your `panic-full-*.panic` log
60+
- **`reproductions`** — production data points (hardware, RAM, time-to-panic, workload)
61+
- **`community`** — cross-references to GitHub issues / lmstudio bugs / forum threads where others hit the same panic
62+
- **`recommendation`** — actionable workaround (backend switch / model pivot / cadence config)
63+
- **`upstream`** — links to the GitHub issues tracking the underlying driver bug
64+
65+
### How to contribute
66+
67+
If you've hit a kernel panic on a specific MLX model **with metal-guard's defensive layers fully engaged**, your data point is valuable. Open a [Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml) — the template walks you through the schema (model ID / hardware / panic signature / workload / time-to-panic / verified workaround). Schema docs in [CONTRIBUTING.md](CONTRIBUTING.md#known-panic-models-schema).
68+
69+
The registry is intentionally conservative — entries require either a confirmed production reproduction or a clear upstream issue with reproducible signature. We don't want false positives blacklisting models that work fine for most users.
70+
71+
**Why not just read mlx#3186 comments?** Because that thread mixes hardware reports, hypotheses, attempted fixes, and unrelated discussion. The registry distils it into structured advisory data your code can `check_known_panic_model()` against — and your panic report doesn't disappear into a 50-comment thread.
72+
3573
## The Problem
3674

3775
Apple's Metal GPU driver on Apple Silicon has a bug: when GPU memory management fails, **the kernel panics the entire machine** instead of gracefully killing the process.

0 commit comments

Comments
 (0)