fix: harden simplification safety and add review deletion signal (#969)

tmchow · web-flow · commit 06e2eb00bc1b · 2026-06-21T02:24:58.000-07:00
diff --git a/docs/skills/ce-simplify-code.md b/docs/skills/ce-simplify-code.md
@@ -16,7 +16,7 @@ The compound-engineering ideation chain is `/ce-ideate → /ce-brainstorm → /c
 |----------|--------|
 | What does it do? | Spawns three parallel reviewer agents on the recently-changed code, applies their findings, and verifies behavior is preserved |
 | When to use it | Before opening a PR; after writing a feature; after AI generated code that works but feels heavy |
-| What it produces | Updated code (in place) + a summary of what was changed, what was good as-is, and which checks ran |
+| What it produces | Updated code (in place) + a summary of what was changed, what was good as-is, which checks ran, and a quantified impact by dimension (fixes applied per reuse/quality/efficiency, skipped count, verification result) |
 | What's next | Open the PR via `/ce-commit-push-pr` |
 
 ---
@@ -52,7 +52,7 @@ The orchestrator aggregates their findings, applies fixes, and runs typecheck +
 
 A single "review and improve" prompt collapses into the agent's most-trained directions. Three reviewers each focused on one dimension cover meaningfully more ground:
 
-- **Reuse** — searches for existing utilities and helpers; flags new functions that duplicate existing ones; flags inline logic that could use an existing utility
+- **Reuse** — searches for existing utilities and helpers; flags new functions that duplicate existing ones; flags inline logic that could use an existing utility; flags diff code that reimplements a language standard-library or runtime primitive (gated on behavior-equivalence, excluding UX-changing swaps)
 - **Quality** — redundant state, parameter sprawl, copy-paste with variation, leaky abstractions, stringly-typed code, unnecessary wrappers (in component-tree UI frameworks), deeply nested conditionals, unnecessary comments, dead code / unused imports / unused exports
 - **Efficiency** — unnecessary work (redundant computations, repeat reads), missed concurrency, hot-path bloat, recurring no-op updates, TOCTOU pre-checks, memory issues, overly broad operations
 
@@ -62,7 +62,7 @@ The skill resolves the simplification scope in priority order: explicit user-nam
 
 ### 3. Behavior preservation verification
 
-After applying fixes, the skill runs typecheck and lint over the project and runs tests scoped to the changed paths (broadening when the change has wide reach — e.g., a heavily-imported utility was rewritten). Failures are surfaced clearly with the failing check name and relevant output. **The skill refuses to relax assertions, weaken type signatures, or skip tests to make checks pass** — either fix the underlying break or revert the specific simplification that caused it.
+After applying fixes, the skill runs typecheck and lint over the project and runs tests scoped to the changed paths (broadening when the change has wide reach — e.g., a heavily-imported utility was rewritten). Failures are surfaced clearly with the failing check name and relevant output. **The skill refuses to relax assertions, weaken type signatures, or skip tests to make checks pass** — either fix the underlying break or revert the specific simplification that caused it. It also **never simplifies away a safety check** — input validation at trust boundaries, data-loss-preventing error handling, security checks, and accessibility affordances are preserved even when a finding frames them as removable boilerplate.
 
 ### 4. Mid-tier model selection — cost-aware
 
diff --git a/plugins/compound-engineering/skills/ce-code-review/SKILL.md b/plugins/compound-engineering/skills/ce-code-review/SKILL.md
@@ -550,7 +550,7 @@ Per-severity tables are **5 columns** — `Route` is not shown here (it appears
 7. **Learnings & Past Solutions.** Surface ce-learnings-researcher results: if past solutions are relevant, flag them as "Known Pattern" with links to docs/solutions/ files.
 8. **Agent-Native Gaps.** Surface ce-agent-native-reviewer results. Omit section if no gaps found.
 9. **Deployment Notes.** If ce-deployment-verification-agent ran, surface the key Go/No-Go items: blocking pre-deploy checks, the most important verification queries, rollback caveats, and monitoring focus areas. Keep the checklist actionable rather than dropping it into Coverage. Schema drift appears in the findings tables as `data-migration` P1 rows — do not add a separate Schema Drift section.
-10. **Coverage.** Applied count (when Stage 5c ran), suppressed count by anchor (e.g., "N findings suppressed at anchor 50, M at anchor 25"), mode-aware demotion count, validator drop count and reasons (when Stage 5b ran), any P0/P1 with degraded validation (kept on validator infra failure), validator over-budget drops (when the 15-cap fired), residual risks, testing gaps, failed/timed-out reviewers, and inferred-intent uncertainty when applicable.
+10. **Coverage.** Applied count (when Stage 5c ran), suppressed count by anchor (e.g., "N findings suppressed at anchor 50, M at anchor 25"), mode-aware demotion count, validator drop count and reasons (when Stage 5b ran), any P0/P1 with degraded validation (kept on validator infra failure), validator over-budget drops (when the 15-cap fired), residual risks, testing gaps, failed/timed-out reviewers, and inferred-intent uncertainty when applicable. **Removable surface (only when deletion-oriented maintainability findings exist):** one line giving the approximate net lines/files those findings would remove if applied (e.g., "Removable surface: ~120 lines / 2 files across findings #4, #7"). This is a dead-weight signal, **not** a reduction target — never lower the bar for a finding or invent deletions to grow the number, and omit the line entirely when no finding proposes a deletion.
 11. **Verdict.** Ready to merge / Ready with fixes / Not ready. Fix order if applicable. When an `explicit` plan has unaddressed requirements or implementation units, the verdict must reflect it — a PR that's code-clean but missing planned requirements is "Not ready" unless the omission is intentional. When an `inferred` plan has unaddressed requirements or implementation units, note it in the verdict reasoning but do not block on it alone.
 
 Do not include time estimates.
diff --git a/plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md b/plugins/compound-engineering/skills/ce-code-review/references/review-output-template.md
@@ -95,6 +95,7 @@ Committed: `fix(review): cover empty-format branch + tighten export perms` (work
 ### Coverage
 
 - Suppressed: 2 findings below anchor 75 (1 at anchor 50, 1 at anchor 25)
+- Removable surface: ~40 lines / 1 file across findings #5 (signal only, not a target)
 - Residual risks: No rate limiting on export endpoint
 - Testing gaps: No test for concurrent export requests
 
@@ -148,7 +149,7 @@ This fails because: no pipe-delimited tables, no severity-grouped `###` headers,
 - **Learnings & Past Solutions section** -- results from ce-learnings-researcher, with links to docs/solutions/ files
 - **Agent-Native Gaps section** -- results from ce-agent-native-reviewer. Omit if no gaps found.
 - **Deployment Notes section** -- key checklist items from ce-deployment-verification-agent. Omit if the agent did not run. Schema drift surfaces as `data-migration` findings — no separate section.
-- **Coverage section** -- suppressed count, residual risks, testing gaps, failed reviewers
+- **Coverage section** -- suppressed count, removable surface (only when deletion-oriented maintainability findings exist; approximate net lines/files removable if applied -- a dead-weight signal, never a reduction target, omit otherwise), residual risks, testing gaps, failed reviewers
 - **Summary uses blockquotes** for verdict, reasoning, and fix order
 - **Horizontal rule** (`---`) separates findings from verdict
 - **`###` headers** for each section -- never plain text headers
diff --git a/plugins/compound-engineering/skills/ce-simplify-code/SKILL.md b/plugins/compound-engineering/skills/ce-simplify-code/SKILL.md
@@ -33,6 +33,7 @@ For each change:
 1. **Search for existing utilities and helpers** that could replace newly written code. Look for similar patterns elsewhere in the codebase — common locations are utility directories, shared modules, and files adjacent to the changed ones.
 2. **Flag any new function that duplicates existing functionality.** Suggest the existing function to use instead.
 3. **Flag any inline logic that could use an existing utility** — hand-rolled string manipulation, manual path handling, custom environment checks, ad-hoc type guards, and similar patterns are common candidates.
+4. **Flag diff code that reimplements a language standard-library or runtime primitive** — a hand-written routine the built-in stdlib/runtime API already provides (e.g., a manual array-dedup loop where the language ships a set-based idiom, a hand-rolled deep-clone/deep-merge where the runtime has one). Suggest the built-in **only when it is behavior-equivalent** for the inputs actually in play. Do not propose swaps that change behavior or UX: native UI controls (e.g., a custom date picker to `<input type=date>`), locale/`Intl`-dependent formatting, sort-stability assumptions, and serialization edge cases differ from their hand-rolled versions and are out of scope for a behavior-preserving pass.
 
 ### Agent 2: Code Quality Reviewer
 
@@ -68,6 +69,8 @@ Wait for all three agents to complete. Aggregate their findings and fix each iss
 
 Before applying each fix, confirm it preserves behavior: same output for every input, same error behavior, and same side effects and ordering. If a fix can't clear that test, skip it — automated checks in Step 4 don't cover every behavior.
 
+**Never simplify away a safety check.** Input validation at trust boundaries, error handling that prevents data loss, security checks (authorization, escaping, sanitization), and accessibility affordances are not removable boilerplate — preserve them even when a finding frames them as redundant or inline-able. Code that drops one of these is not simpler, it is unfinished. If a proposed simplification would thin or remove one, skip it.
+
 ## Step 4: Verify behavior is preserved
 
 The premise of this skill is that simplification preserves exact functionality. After applying fixes:
@@ -86,3 +89,5 @@ If no test suite, lint, or typecheck is configured, state that explicitly in the
 ## Step 5: Summarize
 
 Briefly summarize what was good vs improved and fixed, including which checks were run and their results. If there were no findings to act on, confirm the code didn't require any changes.
+
+**Quantify the impact by dimension.** Report what was actually applied, not a line count: fixes applied per reviewer dimension (reuse, quality, efficiency), how many findings were skipped as false-positive or not worth addressing, and the behavior-preservation result (checks run and outcome). For example: "Applied 6 — reuse 2, quality 3, efficiency 1; skipped 2 false positives; typecheck + lint clean, 11 scoped tests pass." Do not headline a net-lines-removed figure or frame fewer lines as the win — many clarity, safety, and efficiency fixes preserve or add lines. The measure is what improved and that behavior held, not how much code shrank.