|
| 1 | +# Contributing to metal-guard |
| 2 | + |
| 3 | +Thanks for considering a contribution. This document covers the two highest-leverage contribution paths: **panic reports for the registry** and **code / docs PRs**. |
| 4 | + |
| 5 | +## Known Panic Models — schema |
| 6 | + |
| 7 | +`KNOWN_PANIC_MODELS` is a community-curated dict of MLX model IDs that kernel-panic Apple Silicon Macs *even with metal-guard's defensive layers engaged*. The schema is intentionally rich so the entry stays useful when others read it months later. |
| 8 | + |
| 9 | +### Required fields |
| 10 | + |
| 11 | +| Field | Type | Description | |
| 12 | +|---|---|---| |
| 13 | +| `panic_signature` | str | Exact `IOGPUMemory.cpp:NNN` line + keyword. Match the C++ source location, not just the panic string — Apple sometimes renames the human-readable text but keeps the line number. | |
| 14 | +| `first_observed` | str (ISO date `YYYY-MM-DD`) | First reproduction. | |
| 15 | +| `last_observed` | str (ISO date `YYYY-MM-DD`) | Most recent reproduction. Bump on each new data point. | |
| 16 | +| `reproductions` | list[str] | Production data points. Each entry must include hardware + RAM + time-to-panic + workload summary. Format: `"<hardware> <ram>GB — <date> — <duration> from worker-ready — <workload one-liner>"`. | |
| 17 | +| `recommendation` | str | Actionable workaround. Specific (backend / model / config) is more useful than generic ("be careful"). Cite the metal-guard version that was tried — recommendations age. | |
| 18 | +| `upstream` | list[str] | URLs of upstream tracking issues (mlx / mlx-lm / mlx-vlm GitHub). At least one. | |
| 19 | + |
| 20 | +### Optional fields |
| 21 | + |
| 22 | +| Field | Type | Description | |
| 23 | +|---|---|---| |
| 24 | +| `community` | list[str] | External cross-references (GitHub comments by other users, lmstudio bugs, forum threads). Strengthens "this isn't just one user". | |
| 25 | +| `panic_by_hardware` | dict | Reserved for v0.10+ schema upgrade — per-hardware observation matrix. Don't add yet. | |
| 26 | +| `notes` | str | Caveats, environmental specifics, anything that would surprise the next reader. | |
| 27 | + |
| 28 | +### Quality bar |
| 29 | + |
| 30 | +Entries are conservative by design. We accept either: |
| 31 | + |
| 32 | +1. **A clean production reproduction** — same hardware reproducing the same panic signature on the same model, with metal-guard's L7/L8/L9 layers active. One-shot anecdotes go in `community` not `reproductions`. |
| 33 | +2. **A confirmed upstream issue** with the same panic signature where the model is named in the issue body or a maintainer comment. |
| 34 | + |
| 35 | +We **do not** accept: |
| 36 | +- "Sometimes panics, sometimes doesn't" without a reproduction recipe |
| 37 | +- Models that only panic without metal-guard engaged (those go in the README's "who is affected" section, not the registry) |
| 38 | +- Models whose panic was clearly a different root cause (OOM-on-load, transformers ImportError, etc.) — those have separate handling in `_VERSION_ADVISORIES` |
| 39 | + |
| 40 | +### Example entry |
| 41 | + |
| 42 | +```python |
| 43 | +"mlx-community/gemma-4-31b-it-8bit": { |
| 44 | + "panic_signature": "IOGPUMemory.cpp:492 prepare_count_underflow", |
| 45 | + "first_observed": "2026-04-23", |
| 46 | + "last_observed": "2026-04-24", |
| 47 | + "reproductions": [ |
| 48 | + "M1 Ultra 64GB — 2026-04-23 03:14 local — ~6 min from worker-ready — " |
| 49 | + "subprocess worker, pre-cross-model-cadence, gemma-4 first-gen flush absent", |
| 50 | + "M1 Ultra 64GB — 2026-04-24 03:14 local — ~1.5 min from worker-ready — " |
| 51 | + "same pipeline, post-fix attempt, panicked sooner", |
| 52 | + ], |
| 53 | + "community": [ |
| 54 | + "Hannecke (M4 Max 64GB) — ml-explore/mlx#3186 — pivoted to " |
| 55 | + "Qwen3-Coder-30B-A3B MoE", |
| 56 | + "lmstudio bug #1740 — hybrid attention (50 sliding + 10 global) " |
| 57 | + "KV cache 8-bit weights 34GB + full ctx KV 20GB+ > 54GB", |
| 58 | + "ml-explore/mlx-lm#883 (M3 Ultra 96GB)", |
| 59 | + ], |
| 60 | + "recommendation": ( |
| 61 | + "metal-guard v0.9.0 narrows the race window via cross-model cadence " |
| 62 | + "(C5) + gemma4_generation_flush (C7) + subprocess_inference_guard " |
| 63 | + "(B1), but does NOT eliminate panic on this model in production " |
| 64 | + "workloads. Switch backend (Ollama / llama.cpp) or pivot to MoE " |
| 65 | + "variant (e.g. mlx-community/gemma-4-26b-a4b-it-4bit)." |
| 66 | + ), |
| 67 | + "upstream": [ |
| 68 | + "https://github.com/ml-explore/mlx/issues/3186", |
| 69 | + "https://github.com/ml-explore/mlx-lm/issues/883", |
| 70 | + "https://github.com/ml-explore/mlx/issues/3346", |
| 71 | + ], |
| 72 | +}, |
| 73 | +``` |
| 74 | + |
| 75 | +### How to submit |
| 76 | + |
| 77 | +1. **File a [Known Panic Model report](https://github.com/Harperbot/metal-guard/issues/new?template=known-panic-report.yml)** — issue template walks through the schema. Maintainers will draft the dict entry from your report. |
| 78 | +2. **OR** open a PR directly modifying `KNOWN_PANIC_MODELS` in `metal_guard.py`. Include the issue number you opened first so reviewers can cross-check. |
| 79 | + |
| 80 | +Maintainers may ask for additional data — typically the redacted panic-full-*.panic file (Full Disk Access on macOS required to read) — to confirm the signature before merging. |
| 81 | + |
| 82 | +## Code / docs PRs |
| 83 | + |
| 84 | +Standard GitHub flow. Run `pytest` before submitting. CHANGELOG.md update is required for behavioural changes; not required for typo fixes / docs polish. |
| 85 | + |
| 86 | +If your PR adds a new defence layer (L10+), please also extend the test matrix to cover the new layer's failure modes. |
| 87 | + |
| 88 | +## License |
| 89 | + |
| 90 | +By contributing you agree your contribution is licensed under the same MIT license as the project. |
0 commit comments