|
| 1 | +--- |
| 2 | +title: "feat: Add PR reviewability analysis and scope monitoring" |
| 3 | +type: feat |
| 4 | +status: active |
| 5 | +date: 2026-03-17 |
| 6 | +--- |
| 7 | + |
| 8 | +# Add PR Reviewability Analysis and Scope Monitoring |
| 9 | + |
| 10 | +Add a new review agent and scope monitoring to catch oversized PRs before they become unreviewable -- and prevent them from growing too large during `/ce:work`. |
| 11 | + |
| 12 | +Motivated by [PR #536](https://github.com/EveryInc/proof/pull/536): 172 files changed, ~50K lines, closed because it introduced 3 independent regressions across unrelated concerns that nobody caught in review. |
| 13 | + |
| 14 | +## Acceptance Criteria |
| 15 | + |
| 16 | +### New Agent: `pr-reviewability-analyst` |
| 17 | + |
| 18 | +- [x] Create `plugins/compound-engineering/agents/review/pr-reviewability-analyst.md` |
| 19 | +- [x] Agent analyzes PR diff for: effective line count, file count, directory spread, and concern count |
| 20 | +- [x] Weighted counting: test files at 50%, generated/lock files excluded, deletions at 10% |
| 21 | +- [x] Default thresholds: >400 effective lines, >7 files, or >2 distinct concerns triggers a flag |
| 22 | +- [x] When flagged: propose concrete split into 2-5 stacked PRs by clustering files by concern |
| 23 | +- [x] Each proposed sub-PR includes: one-sentence description, file list, estimated line count |
| 24 | +- [x] Output: reviewability score (0-100) + pass/fail + split proposal (if failing) |
| 25 | +- [x] Agent description stays under 250 chars for context budget |
| 26 | + |
| 27 | +### Integration into `/ce:review` |
| 28 | + |
| 29 | +- [x] Add reviewability check as a **pre-flight step** in `ce-review/SKILL.md` between Step 1 (setup) and the parallel agent dispatch |
| 30 | +- [x] Runs as a hardcoded system check, not part of the user-configured `review_agents` list |
| 31 | +- [x] If score < 60: present split recommendation as a P2 finding, then continue running all other review agents (do NOT gate/block) |
| 32 | +- [x] If score >= 60: proceed normally with no extra output |
| 33 | +- [x] Split recommendation appears in the findings synthesis alongside other agent results |
| 34 | + |
| 35 | +### Scope Monitor in `/ce:work` |
| 36 | + |
| 37 | +- [x] After each task completion in Phase 2 execution loop, measure total branch divergence: `git diff $(git merge-base HEAD <default_branch>)..HEAD --stat` |
| 38 | +- [x] Soft warning at 300 effective lines or 5 files: "You're at X lines across Y files. Consider checkpointing." |
| 39 | +- [x] Hard warning at 500 effective lines or 10 files: strongly recommend splitting, suggest split point based on completed vs remaining tasks |
| 40 | +- [x] After user declines a hard warning: downgrade to a single reminder after next 100 additional lines (no warning fatigue) |
| 41 | +- [x] Offer: create checkpoint commit + push + start new stacked branch for remaining tasks |
| 42 | +- [x] Scope monitor disabled in swarm mode (add note for team lead to check periodically) |
| 43 | + |
| 44 | +### Configuration |
| 45 | + |
| 46 | +- [x] Add `reviewability` section to `compound-engineering.local.md` YAML frontmatter schema (documented in ce-review SKILL.md Step 1.5 and agent defaults): |
| 47 | + ```yaml |
| 48 | + reviewability: |
| 49 | + line_threshold: 400 # effective lines to trigger review flag |
| 50 | + file_threshold: 7 # files changed to trigger review flag |
| 51 | + concern_threshold: 2 # distinct concerns to trigger review flag |
| 52 | + work_soft_threshold: 300 # soft warning during ce:work |
| 53 | + work_hard_threshold: 500 # hard warning during ce:work |
| 54 | + exclude_patterns: # files excluded from line counts |
| 55 | + - "*.lock" |
| 56 | + - "db/schema.rb" |
| 57 | + - "db/structure.sql" |
| 58 | + test_weight: 0.5 # multiplier for test file lines |
| 59 | + deletion_weight: 0.1 # multiplier for deleted lines |
| 60 | + ``` |
| 61 | +- [ ] Setup skill updated to include reviewability defaults when creating new settings files (follow-up) |
| 62 | +
|
| 63 | +### Counting Rules |
| 64 | +
|
| 65 | +- [x] **Effective lines** = (added lines x 1.0) + (deleted lines x 0.1) + (test lines x 0.5) |
| 66 | +- [x] Exclude files matching `exclude_patterns` from line counts entirely |
| 67 | +- [x] Include excluded files in the file list but not in thresholds |
| 68 | +- [x] Detect test files by path convention: `test/`, `spec/`, `__tests__/`, `*.test.*`, `*.spec.*` |
| 69 | +- [x] Detect concerns by clustering changed files into groups by: top-level directory, then by semantic similarity (model+migration = 1 concern, controller+route = 1 concern) |
| 70 | + |
| 71 | +### Documentation |
| 72 | + |
| 73 | +- [x] Update `plugins/compound-engineering/README.md` with new agent |
| 74 | +- [x] Update plugin.json description with new agent count |
| 75 | +- [x] Update marketplace.json description with new agent count |
| 76 | +- [ ] Run `/release-docs` after changes (deferred — docs site rebuild) |
| 77 | + |
| 78 | +## Context |
| 79 | + |
| 80 | +### Why This Matters |
| 81 | + |
| 82 | +PR #536 was a well-intentioned refactor ("remove macOS app surface, make Proof web-only") that grew to 172 files and ~50K lines. It was closed because: |
| 83 | + |
| 84 | +1. Removed web editor agent entry points (`keybindings.ts`, `editor/index.ts`, `context-menu.ts`) with no replacement |
| 85 | +2. Removed `@proof` comment/reply trigger plumbing that stopped existing agent workflows |
| 86 | +3. Regressed a hosted setup/help route by falling back to wrong skill |
| 87 | + |
| 88 | +All three regressions were in **unrelated areas** -- classic symptom of a PR that mixes too many concerns. No reviewer could reasonably catch all three in a single pass through 172 files. |
| 89 | + |
| 90 | +### Industry Research |
| 91 | + |
| 92 | +- **SmartBear/Cisco**: Reviews of 200-400 LOC find 70-90% of defects. Beyond 400, detection drops sharply. |
| 93 | +- **Graphite (1.5M PRs)**: Beyond 3 files, merge time more than doubles. |
| 94 | +- **Google eng-practices**: 100 lines ideal, 1000 too large. Stacking is recommended. |
| 95 | +- **GitLab**: 500 changes is the hard threshold requiring justification. |
| 96 | + |
| 97 | +### Design Decisions |
| 98 | + |
| 99 | +1. **Non-blocking**: The reviewability check is P2 (informational), not P1 (blocks merge). Teams should adopt gradually. |
| 100 | +2. **No automated splitting in v1**: Agent outputs a structured split proposal. Automated branch creation/cherry-picking is a v2 enhancement. |
| 101 | +3. **Hardcoded pre-flight**: Not part of user-configurable agent list -- always runs, like a linter. Users configure thresholds, not whether it runs. |
| 102 | +4. **Measure branch divergence in ce:work**: Total diff from default branch (not just uncommitted changes), so incremental commits don't reset the counter. |
| 103 | + |
| 104 | +## MVP |
| 105 | + |
| 106 | +### pr-reviewability-analyst.md |
| 107 | + |
| 108 | +```markdown |
| 109 | +--- |
| 110 | +name: pr-reviewability-analyst |
| 111 | +description: Analyzes PR size, concern count, and file spread. Proposes stacked PR splits when thresholds are exceeded. Run automatically as pre-flight check during /ce:review. |
| 112 | +subagent_type: general-purpose |
| 113 | +--- |
| 114 | +
|
| 115 | +# PR Reviewability Analyst |
| 116 | +
|
| 117 | +You are a PR reviewability analyst. Your job is to evaluate whether a pull request is reviewable as-is or should be split into smaller, independently shippable PRs. |
| 118 | +
|
| 119 | +## Input |
| 120 | +
|
| 121 | +You will receive: |
| 122 | +- PR metadata (title, description, files changed, additions, deletions) |
| 123 | +- The full diff or file list |
| 124 | +
|
| 125 | +## Analysis Steps |
| 126 | +
|
| 127 | +### 1. Calculate Effective Size |
| 128 | +
|
| 129 | +Measure the PR using weighted counting: |
| 130 | +
|
| 131 | +- **Added lines**: weight 1.0 |
| 132 | +- **Deleted lines**: weight 0.1 (deletions are easy to review) |
| 133 | +- **Test file lines**: weight 0.5 (tests accompany implementation) |
| 134 | +- **Excluded files**: skip entirely (lock files, generated schemas) |
| 135 | +
|
| 136 | +Detect test files by path: `test/`, `spec/`, `__tests__/`, files matching `*.test.*` or `*.spec.*` |
| 137 | + |
| 138 | +### 2. Count Distinct Concerns |
| 139 | + |
| 140 | +Cluster changed files by concern: |
| 141 | +- Group by top-level directory first |
| 142 | +- Then merge groups that are semantically one concern: |
| 143 | + - `db/migrate/` + `app/models/` + model tests = "Schema/Model change" |
| 144 | + - `app/controllers/` + `config/routes.rb` + controller tests = "API/Route change" |
| 145 | + - `app/views/` + `app/javascript/` + view tests = "UI change" |
| 146 | + - `config/` + CI files = "Configuration change" |
| 147 | +- Count the resulting groups |
| 148 | + |
| 149 | +A single concern touching multiple directories is 1 concern, not many. |
| 150 | + |
| 151 | +### 3. Score Reviewability |
| 152 | + |
| 153 | +Start at 100, subtract: |
| 154 | +- Lines over threshold: -2 per 50 lines over (max -30) |
| 155 | +- Files over threshold: -5 per file over (max -30) |
| 156 | +- Concerns over threshold: -15 per concern over (max -30) |
| 157 | +- No tests for new code: -10 |
| 158 | + |
| 159 | +Score interpretation: |
| 160 | +- 80-100: Excellent reviewability |
| 161 | +- 60-79: Acceptable, proceed with review |
| 162 | +- 40-59: Poor, recommend splitting |
| 163 | +- 0-39: Unreviewable, strongly recommend splitting |
| 164 | + |
| 165 | +### 4. Generate Split Proposal (if score < 60) |
| 166 | + |
| 167 | +For each identified concern cluster: |
| 168 | +1. List the files in that cluster |
| 169 | +2. Write a one-sentence PR description |
| 170 | +3. Estimate effective line count |
| 171 | +4. Identify dependencies on other clusters |
| 172 | +5. Order the stack (independent PRs first, dependent PRs later) |
| 173 | + |
| 174 | +Format: |
| 175 | +``` |
| 176 | +## Reviewability Score: [SCORE]/100 — [PASS/FAIL] |
| 177 | +
|
| 178 | +### Metrics |
| 179 | +- Effective lines: X (threshold: Y) |
| 180 | +- Files changed: X (threshold: Y) |
| 181 | +- Distinct concerns: X (threshold: Y) |
| 182 | +- Test coverage: [present/missing for new code] |
| 183 | +
|
| 184 | +### [If FAIL] Proposed Split |
| 185 | +
|
| 186 | +**Stack Order:** |
| 187 | +
|
| 188 | +PR 1/N: "[one-sentence description]" (base: main) |
| 189 | + Files: [list] |
| 190 | + ~X effective lines, Y files |
| 191 | +
|
| 192 | +PR 2/N: "[one-sentence description]" (base: PR 1) |
| 193 | + Files: [list] |
| 194 | + ~X effective lines, Y files |
| 195 | + Depends on: PR 1 (needs [specific thing]) |
| 196 | +
|
| 197 | +... |
| 198 | +``` |
| 199 | + |
| 200 | +### 5. If PASS |
| 201 | + |
| 202 | +Output only the score summary. No split proposal needed. |
| 203 | + |
| 204 | +## Important |
| 205 | + |
| 206 | +- A PR can be large in lines but still reviewable if it is a single concern (e.g., mechanical rename across 50 files) |
| 207 | +- Weight your judgment by concern count more than raw line count |
| 208 | +- Deletion-heavy PRs (removing dead code) are generally more reviewable than addition-heavy PRs |
| 209 | +- Generated code (migrations, lock files) should be noted but not counted |
| 210 | +``` |
| 211 | +
|
| 212 | +### ce-review SKILL.md integration point |
| 213 | +
|
| 214 | +Insert after Step 1 (Determine Review Target & Setup), before the parallel agent dispatch: |
| 215 | +
|
| 216 | +```markdown |
| 217 | +### 1.5. Pre-Flight Reviewability Check (Always Runs) |
| 218 | + |
| 219 | +Before dispatching review agents, run a quick reviewability analysis: |
| 220 | + |
| 221 | +- Task compound-engineering:review:pr-reviewability-analyst(PR metadata + file list + diff stats) |
| 222 | + |
| 223 | +**Read thresholds from `compound-engineering.local.md`** frontmatter `reviewability` section. If not configured, use defaults: 400 lines, 7 files, 2 concerns. |
| 224 | + |
| 225 | +**If score >= 60**: Log "Reviewability: [SCORE]/100 — PASS" and proceed to agent dispatch. |
| 226 | + |
| 227 | +**If score < 60**: |
| 228 | +- Log the full reviewability report (score, metrics, split proposal) |
| 229 | +- Add the split recommendation as a P2 finding during synthesis (Step 5) |
| 230 | +- **Continue running all review agents** — do not gate or block |
| 231 | +- In the summary report, add a "Reviewability" section before the findings |
| 232 | +``` |
| 233 | +
|
| 234 | +### ce-work SKILL.md integration point |
| 235 | +
|
| 236 | +Insert into Phase 2 Task Execution Loop, after "Mark task as completed" and before "Evaluate for incremental commit": |
| 237 | +
|
| 238 | +```markdown |
| 239 | + **Scope Check** — After completing each task, measure total branch size: |
| 240 | + |
| 241 | + ```bash |
| 242 | + default_branch=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@') |
| 243 | + [ -z "$default_branch" ] && default_branch=$(git rev-parse --verify origin/main >/dev/null 2>&1 && echo "main" || echo "master") |
| 244 | + merge_base=$(git merge-base HEAD origin/$default_branch) |
| 245 | +
|
| 246 | + # Get stats excluding generated files |
| 247 | + stats=$(git diff $merge_base..HEAD --stat --stat-width=999 | tail -1) |
| 248 | + # Parse: "X files changed, Y insertions(+), Z deletions(-)" |
| 249 | + ``` |
| 250 | + |
| 251 | + Read thresholds from `compound-engineering.local.md` frontmatter `reviewability` section. Defaults: soft=300, hard=500. |
| 252 | + |
| 253 | + | Effective Lines | Files | Action | |
| 254 | + |----------------|-------|--------| |
| 255 | + | < soft threshold | < 5 | Continue silently | |
| 256 | + | >= soft threshold | >= 5 | Warn: "Scope check: ~X effective lines across Y files. Consider checkpointing before continuing." | |
| 257 | + | >= hard threshold | >= 10 | Strongly recommend: "This branch has grown to ~X lines across Y files. Recommend creating a checkpoint PR now." Offer: (1) Create checkpoint commit + push + new stacked branch, (2) Continue anyway | |
| 258 | + |
| 259 | + After user declines a hard warning, suppress until 100+ additional effective lines accumulate. |
| 260 | + |
| 261 | + In swarm mode: skip scope checks (team lead should monitor manually). |
| 262 | +``` |
| 263 | +
|
| 264 | +## Sources |
| 265 | +
|
| 266 | +- PR #536: https://github.com/EveryInc/proof/pull/536 |
| 267 | +- SmartBear/Cisco study on code review effectiveness |
| 268 | +- Graphite research on 1.5M PRs: file count strongly correlates with merge time |
| 269 | +- Google eng-practices: https://google.github.io/eng-practices/review/developer/small-cls.html |
| 270 | +- GitLab contribution guide: 500-change hard threshold |
| 271 | +- Martin Fowler: Feature Toggles, Branch by Abstraction |
| 272 | +- Existing plugin: `plugins/compound-engineering/skills/ce-review/SKILL.md` |
| 273 | +- Existing plugin: `plugins/compound-engineering/skills/ce-work/SKILL.md` |
0 commit comments