levnikolaevich
diff --git a/‎AGENTS.md‎
Lines changed: 4 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 2 deletions b/‎README.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/README.md‎
Lines changed: 3 additions & 0 deletions b/‎docs/README.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/architecture/SKILL_ARCHITECTURE_GUIDE.md‎
Lines changed: 8 additions & 0 deletions b/‎docs/architecture/SKILL_ARCHITECTURE_GUIDE.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/best-practice/MCP_OUTPUT_CONTRACT_GUIDE.md‎
Lines changed: 15 additions & 7 deletions b/‎docs/best-practice/MCP_OUTPUT_CONTRACT_GUIDE.md‎
Lines changed: 15 additions & 7 deletions
diff --git a/‎docs/best-practice/MCP_TOOL_DESIGN_GUIDE.md‎
Lines changed: 8 additions & 4 deletions b/‎docs/best-practice/MCP_TOOL_DESIGN_GUIDE.md‎
Lines changed: 8 additions & 4 deletions
diff --git a/‎docs/plugins/agile-workflow.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/plugins/agile-workflow.md‎
Lines changed: 2 additions & 2 deletions
@@ -70,6 +70,8 @@ Use `hex-line` first for repo file reads/search/edits on code, config, scripts,
 | Architecture patterns (L0-L3) | `cat docs/architecture/SKILL_ARCHITECTURE_GUIDE.md` |
 | Agent delegation runtime | `cat docs/architecture/AGENT_DELEGATION_PLATFORM_GUIDE.md` |
 | Tool configuration | `cat skills-catalog/shared/references/environment_state_contract.md` |
+| Loop health model | `cat skills-catalog/shared/references/loop_health_contract.md` |
+| Procedural SOP/TWI guide | `cat skills-catalog/shared/references/procedural_skill_sop_guide.md` |
 | Key workflow | `ln-700 -> ln-100 -> ln-200 -> ln-1000` |
 | Skill metadata | `head -20 {ln-NNN}/SKILL.md` |
 | Reference files for a skill | `ls {ln-NNN}/references/` |
@@ -86,6 +88,8 @@ Use `hex-line` first for repo file reads/search/edits on code, config, scripts,
 | Agent instructions writing guide | `skills-catalog/shared/references/agent_instructions_writing_guide.md` |
 | Writing guidelines | `docs/architecture/SKILL_ARCHITECTURE_GUIDE.md` |
 | Environment State | `skills-catalog/shared/references/environment_state_contract.md` |
+| Loop Health | `skills-catalog/shared/references/loop_health_contract.md` |
+| Procedural SOP/TWI | `skills-catalog/shared/references/procedural_skill_sop_guide.md` |
 | Risk-Based Testing | `skills-catalog/shared/references/risk_based_testing_guide.md` |
 | Frontmatter fields | `skills-catalog/shared/references/frontmatter_reference.md` |
 | Questions format | `skills-catalog/shared/references/questions_format.md` |
 
@@ -112,11 +112,13 @@ ln-200-scope-decomposer         # 2. Scope -> Epics -> Stories
 ln-1000-pipeline-orchestrator   # 3. Full artifact-driven pipeline: 300 → 310 → 400 → 500 → Done
 ```
 
+Coordinators keep lifecycle status separate from Loop Health: `status` says where the run is, artifacts/checkpoints prove completion, and `loop_health` decides whether another retry is useful. Procedural skills use SOP/TWI-style point-of-use checklists so risky steps carry action, key point, why, evidence, exception, and guard close to the moment of use.
+
 ---
 
 ## MCP Servers (Optional)
 
-Bundled MCP servers extend agent capabilities — hash-verified editing, code intelligence, and remote access. All skills work without MCP (fallback to built-in tools), but MCP servers improve accuracy and save tokens.
+Bundled MCP servers extend agent capabilities — hash-verified editing, code intelligence, and remote access. All skills work without MCP (fallback to built-in tools), but MCP servers improve accuracy and save tokens. MCP errors stay as `status: "ERROR"` and include `failure_class`, `next_action`, and recovery fields so skills can feed transport/tool/auth/rate-limit signals into Loop Health without inventing a second retry loop.
 
 ### Bundled servers
 
@@ -129,7 +131,7 @@ Bundled MCP servers extend agent capabilities — hash-verified editing, code in
 Deterministic scope rule: `hex-line` and `hex-graph` keep `path` as the project anchor. In normal use the agent fills it automatically from the active file or project root, so users usually do not need to type it manually. `hex-ssh` runs on Windows/macOS/Linux hosts; remote shell tools stay POSIX-oriented, while SFTP transfers support platform-aware remote paths.
 
 <!-- GENERATED:HEX_GRAPH_MCP_STATUS:START -->
-`hex-graph-mcp` quality snapshot: `103/103` tests passing, `1` curated corpus, `1` pinned external corpora, parser-first `green`.
+`hex-graph-mcp` quality snapshot: `106/106` tests passing, `1` curated corpus, `1` pinned external corpora, parser-first `green`.
 <!-- GENERATED:HEX_GRAPH_MCP_STATUS:END -->
 
 ### External servers
 
@@ -11,6 +11,7 @@ docs/
 |   `-- AGENT_DELEGATION_PLATFORM_GUIDE.md # Skill vs subagent runtime, recovery, Windows
 |-- best-practice/                   # Claude Code usage guidance
 |   |-- COMPONENT_SELECTION.md
+|   |-- MCP_TOOL_DESIGN_GUIDE.md     # MCP naming, bounded output, clean-cut migration, error classes
 |   |-- MCP_OUTPUT_CONTRACT_GUIDE.md # Canonical MCP status/reason/next_action vocabulary
 |   `-- WORKFLOW_TIPS.md
 |-- plugins/                         # Per-plugin landing pages
@@ -55,3 +56,5 @@ docs/
 |-------|------|
 | MCP tool design | [best-practice/MCP_TOOL_DESIGN_GUIDE.md](best-practice/MCP_TOOL_DESIGN_GUIDE.md) |
 | MCP output contract | [best-practice/MCP_OUTPUT_CONTRACT_GUIDE.md](best-practice/MCP_OUTPUT_CONTRACT_GUIDE.md) |
+| Loop Health contract | [../skills-catalog/shared/references/loop_health_contract.md](../skills-catalog/shared/references/loop_health_contract.md) |
+| Procedural SOP/TWI guide | [../skills-catalog/shared/references/procedural_skill_sop_guide.md](../skills-catalog/shared/references/procedural_skill_sop_guide.md) |
@@ -44,6 +44,8 @@ Rule of thumb:
 | **Single source of truth** | Put enforceable rules in shared refs, not scattered prose |
 | **Top-down ownership** | Coordinators know workers; workers do not encode ownership hierarchy back upward |
 | **Token efficiency** | Remove duplicated prose and keep only action-relevant detail |
+| **Loop-aware retries** | Keep lifecycle status separate from `loop_health`; repeated attempts need new evidence |
+| **SOP/TWI execution** | Procedural steps carry action, key point, why, evidence, exception, and guard at point of use |
 
 ### Skill Layers
 
@@ -157,6 +159,8 @@ These are repo heuristics, not universal laws.
 |---------|--------------|
 | table-oriented metadata | cheaper to scan than long prose |
 | imperative workflow steps | easier for agents to execute |
+| point-of-use risk checklists | prevents skipped critical steps when progressive disclosure hides later sections |
+| step -> key point -> why | makes risky instructions harder to bypass or reinterpret |
 | short direct sentences | lowers context cost and ambiguity |
 | `MANDATORY READ` only for execution-critical files | keeps context minimal and intentional |
 | detail in `references/` | prevents giant monolithic skills |
@@ -188,6 +192,8 @@ Use `skills-catalog/shared/concise_terms.md` for wording cleanup.
 | worker defines parent or coordinator | reverse coupling | remove ownership wording |
 | caller describes workers but never invokes them explicitly | agents tend to inline logic | add `Skill()` blocks and Worker Invocation section |
 | same threshold or rule repeated in many skills | drift risk | move to shared ref |
+| retries are driven only by lifecycle status | retry storms and same-error loops | use `shared/references/loop_health_contract.md` |
+| final DoD carries all safety checks | agents may miss point-of-use risks | colocate SOP/TWI checklist at the risky step |
 | giant inline instructions that are rarely needed | context waste | move to conditional shared ref |
 | stale platform/runtime references | agent confusion | update or delete immediately |
 | giant root map or giant skill manual | crowds out task context | keep map-first and route outward |
@@ -210,6 +216,7 @@ Use `skills-catalog/shared/concise_terms.md` for wording cleanup.
 - [ ] The chosen layer is appropriate
 - [ ] `Skill` vs `Agent` choice is justified
 - [ ] Shared refs reduce duplication instead of adding indirection
+- [ ] Retry loops use `loop_health` evidence instead of lifecycle status alone
 - [ ] File size and workflow shape still fit the responsibility
 
 ### Writing Check
@@ -218,6 +225,7 @@ Use `skills-catalog/shared/concise_terms.md` for wording cleanup.
 - [ ] Tables replace verbose prose where useful
 - [ ] Sentences are short and direct
 - [ ] Repeated instructions have been merged or removed
+- [ ] Procedural risky steps include point-of-use action/key point/why/evidence/exception/guard
 
 ---
 
 
@@ -34,11 +34,12 @@ When a tool returns structured content, prefer this order:
 
 1. `status`
 2. `reason`
-3. `revision` / `file` / `path` / `query` identity
-4. `next_action` or `next_actions`
-5. `summary`
-6. recovery helpers such as `retry_edit`, `retry_edits`, `suggested_read_call`, `retry_plan`
-7. detailed sections such as `result`, `warnings`, `snippet`, `risk_summary`
+3. `failure_class` for classified errors
+4. `revision` / `file` / `path` / `query` identity
+5. `next_action` or `next_actions`
+6. `summary`
+7. recovery helpers such as `retry_edit`, `retry_edits`, `suggested_read_call`, `retry_plan`
+8. detailed sections such as `result`, `warnings`, `snippet`, `risk_summary`
 
 Reason: agents should see decision fields before supporting detail.
 
@@ -130,6 +131,11 @@ Rules:
 | `inspect_raw_diff` | Fall back to raw diff because semantic mode is unavailable |
 | `review_risks` | Inspect risk details before acting |
 | `no_action` | Nothing to do |
+| `fix_permissions` | Correct file, SSH, graph DB, or OS permissions |
+| `install_tool` | Install or expose a missing executable/provider |
+| `authenticate` | Configure credentials, token, OAuth, or SSH key |
+| `defer_retry` | Wait for rate-limit or quota recovery before retrying |
+| `retry_after_wait` | Retry after an idle timeout or transient busy resource clears |
 
 ### Canonical graph labels
 
@@ -202,15 +208,16 @@ Structured-output MCP errors should use this public shape in `structuredContent`
 - `summary`
 - `next_action`
 - `recovery`
+- `failure_class`
 - `error: { code, message, recovery }` (canonical sub-object)
 
-`summary` explains what failed. `next_action` names the immediate category of recovery. `recovery` gives the human-readable instruction.
+`summary` explains what failed. `next_action` names the immediate category of recovery. `recovery` gives the human-readable instruction. `failure_class` is the machine-readable transport/tool/auth/rate-limit/timeout signal consumed by Loop Health.
 
 On the MCP envelope level, ALSO set `isError: true` (see section 8).
 
 Do not return raw stack traces in public tool outputs.
 
-Text-grammar servers express errors inside their grammar. For `hex-graph`, the first line is `error <next_action> ...` and body lines carry details such as `!code=<CODE>` and `!message=<text>`.
+Text-grammar servers express errors inside their grammar. For `hex-graph`, the first line is `error <next_action> ...` and body lines carry details such as `!code=<CODE>`, `!failure_class=<CLASS>`, and `!message=<text>`.
 
 ## 8. MCP Envelope
 
@@ -281,6 +288,7 @@ Before merging MCP output changes, check:
 - `next_action` / `next_actions` use labels, not prose
 - `summary` is compact and non-redundant
 - structured error outputs set BOTH `isError: true` and `structuredContent.status: "ERROR"`
+- structured and text-grammar errors include `failure_class` when a failure can affect retry usefulness
 - structured tools declare `outputSchema` that matches actual `structuredContent` shape (schema-contract tests)
 - text-grammar tools do not declare `outputSchema` and have grammar contract tests
 - large results set `_meta["anthropic/maxResultSizeChars"]`
 
@@ -39,7 +39,9 @@ Rule: if a tool can return >100 lines, it MUST support truncation or a compact m
 | `NOOP_EDIT` | Edit produced identical content | Inform file already has desired content |
 | `OUT_OF_RANGE` | Line number exceeds file length | Show boundary snippet with hashes |
 
-Anti-pattern: raw stack traces. Agents cannot act on `Error: ENOENT` -- they need recovery actions.
+Every public error also carries `failure_class`: `permission_denial`, `tool_missing`, `auth_missing`, `rate_limited`, `timeout_idle`, `timeout_productive`, or `unknown`. This lets skills feed MCP failures into Loop Health without making the MCP server own retry policy.
+
+Anti-pattern: raw stack traces. Agents cannot act on `Error: ENOENT` or `Cannot read properties of undefined` -- they need `code`, `summary`, `next_action`, `recovery`, `failure_class`, and `error`.
 
 ## 4. Tool Descriptions — WHEN to use, not WHAT it does
 
@@ -91,14 +93,15 @@ When truncating, prefer structured metadata that tells the agent how to narrow n
 
 Build a tool when: it saves tokens, adds verification, or prevents errors shell cannot catch.
 
-## 9. Evolution -- periodically review constraints
+## 9. Evolution -- clean-cut migrations for repo-owned MCP
 
 | Practice | Example |
 |----------|--------|
 | Review constraints | `TodoWrite` removed from Claude Code -- tools wrapping it become dead weight |
 | Track usage patterns | If agents never use `plain`, remove it or make it default |
-| Version schemas | Breaking input changes break cached agent behavior |
-| Deprecate before removing | "DEPRECATED: use X instead" in description, remove after one cycle |
+| Make clean cuts | Remove repo-owned obsolete names and update docs/tests in the same change |
+| Keep lifecycle stable | Preserve public `status` meanings; add evidence fields instead of a second status path |
+| Migrate by tests | Contract tests prove docs, schemas, and examples no longer mention the old path |
 
 ## 10. Tool Annotations
 
@@ -124,6 +127,7 @@ Canonical fields (match [MCP_OUTPUT_CONTRACT_GUIDE.md](./MCP_OUTPUT_CONTRACT_GUI
 - `status` -- required, from canonical vocabulary (OK, ERROR, CONFLICT, STALE, etc.)
 - `reason` -- machine-readable classifier
 - `next_action` -- canonical label from output contract
+- `failure_class` -- classified transport/tool/auth/rate-limit/timeout signal for Loop Health
 - `error: {code, message, recovery}` -- only when status is ERROR
 - domain-specific payload -- tool-owned fields (matches, outline, etc.)
 
 
@@ -14,7 +14,7 @@
 
 ## What it does
 
-Automates the full Agile delivery cycle. Coordinators advance only from machine-readable artifacts, while task-plan, execution, quality, and test-planning workers keep their own runtime state and stay standalone-capable. Integrates with Linear or works standalone with markdown files.
+Automates the full Agile delivery cycle. Coordinators advance only from machine-readable artifacts, while task-plan, execution, quality, and test-planning workers keep their own runtime state and stay standalone-capable. Loop Health prevents same-error and no-progress retry storms without changing lifecycle statuses. Integrates with Linear or works standalone with markdown files.
 
 ## Skills
 
@@ -56,7 +56,7 @@ ln-200 (scope) -> ln-300 (tasks) -> ln-310 (validate)
     -> ln-500 (quality gate)
 ```
 
-`ln-220`, `ln-300`, `ln-400`, `ln-510`, and `ln-520` keep coordinator runtime state and checkpoint child worker runs for resume. `ln-221/222`, `ln-301/302`, `ln-401..404`, `ln-511..514`, and `ln-521..523` remain standalone-capable workers with their own run-scoped state and summaries. `ln-1000` advances only from coordinator stage artifacts, while `ln-310` runs registry-configured external-agent review before execution begins.
+`ln-220`, `ln-300`, `ln-400`, `ln-510`, and `ln-520` keep coordinator runtime state and checkpoint child worker runs for resume. `ln-400` and `ln-1000` also record `loop_health` when an attempt repeats a task, worker, stage, error, or validation segment. `ln-221/222`, `ln-301/302`, `ln-401..404`, `ln-511..514`, and `ln-521..523` remain standalone-capable workers with their own run-scoped state and summaries. `ln-1000` advances only from coordinator stage artifacts, while `ln-310` runs registry-configured external-agent review before execution begins.
 
 ## Quick start