Skip to content

Commit f1c2cf2

Browse files
kieranklaassenclaude
authored andcommitted
feat: add PR reviewability analysis and scope monitoring
Add pr-reviewability-analyst agent that scores PRs on size, concern count, and file spread, proposing stacked PR splits when thresholds are exceeded. Integrate as a pre-flight check in /ce:review (Step 1.5) and add scope monitoring to /ce:work's task execution loop with soft/hard warnings and checkpoint flow. Motivated by EveryInc/proof#536 — a 172-file PR closed because it mixed too many concerns and introduced 3 independent regressions. 🤖 Generated with Claude Opus 4.6 via [Claude Code](https://claude.com/claude-code) + Compound Engineering v2.41.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4aa50e1 commit f1c2cf2

5 files changed

Lines changed: 500 additions & 2 deletions

File tree

Lines changed: 273 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
---
2+
title: "feat: Add PR reviewability analysis and scope monitoring"
3+
type: feat
4+
status: active
5+
date: 2026-03-17
6+
---
7+
8+
# Add PR Reviewability Analysis and Scope Monitoring
9+
10+
Add a new review agent and scope monitoring to catch oversized PRs before they become unreviewable -- and prevent them from growing too large during `/ce:work`.
11+
12+
Motivated by [PR #536](https://github.com/EveryInc/proof/pull/536): 172 files changed, ~50K lines, closed because it introduced 3 independent regressions across unrelated concerns that nobody caught in review.
13+
14+
## Acceptance Criteria
15+
16+
### New Agent: `pr-reviewability-analyst`
17+
18+
- [x] Create `plugins/compound-engineering/agents/review/pr-reviewability-analyst.md`
19+
- [x] Agent analyzes PR diff for: effective line count, file count, directory spread, and concern count
20+
- [x] Weighted counting: test files at 50%, generated/lock files excluded, deletions at 10%
21+
- [x] Default thresholds: >400 effective lines, >7 files, or >2 distinct concerns triggers a flag
22+
- [x] When flagged: propose concrete split into 2-5 stacked PRs by clustering files by concern
23+
- [x] Each proposed sub-PR includes: one-sentence description, file list, estimated line count
24+
- [x] Output: reviewability score (0-100) + pass/fail + split proposal (if failing)
25+
- [x] Agent description stays under 250 chars for context budget
26+
27+
### Integration into `/ce:review`
28+
29+
- [x] Add reviewability check as a **pre-flight step** in `ce-review/SKILL.md` between Step 1 (setup) and the parallel agent dispatch
30+
- [x] Runs as a hardcoded system check, not part of the user-configured `review_agents` list
31+
- [x] If score < 60: present split recommendation as a P2 finding, then continue running all other review agents (do NOT gate/block)
32+
- [x] If score >= 60: proceed normally with no extra output
33+
- [x] Split recommendation appears in the findings synthesis alongside other agent results
34+
35+
### Scope Monitor in `/ce:work`
36+
37+
- [x] After each task completion in Phase 2 execution loop, measure total branch divergence: `git diff $(git merge-base HEAD <default_branch>)..HEAD --stat`
38+
- [x] Soft warning at 300 effective lines or 5 files: "You're at X lines across Y files. Consider checkpointing."
39+
- [x] Hard warning at 500 effective lines or 10 files: strongly recommend splitting, suggest split point based on completed vs remaining tasks
40+
- [x] After user declines a hard warning: downgrade to a single reminder after next 100 additional lines (no warning fatigue)
41+
- [x] Offer: create checkpoint commit + push + start new stacked branch for remaining tasks
42+
- [x] Scope monitor disabled in swarm mode (add note for team lead to check periodically)
43+
44+
### Configuration
45+
46+
- [x] Add `reviewability` section to `compound-engineering.local.md` YAML frontmatter schema (documented in ce-review SKILL.md Step 1.5 and agent defaults):
47+
```yaml
48+
reviewability:
49+
line_threshold: 400 # effective lines to trigger review flag
50+
file_threshold: 7 # files changed to trigger review flag
51+
concern_threshold: 2 # distinct concerns to trigger review flag
52+
work_soft_threshold: 300 # soft warning during ce:work
53+
work_hard_threshold: 500 # hard warning during ce:work
54+
exclude_patterns: # files excluded from line counts
55+
- "*.lock"
56+
- "db/schema.rb"
57+
- "db/structure.sql"
58+
test_weight: 0.5 # multiplier for test file lines
59+
deletion_weight: 0.1 # multiplier for deleted lines
60+
```
61+
- [ ] Setup skill updated to include reviewability defaults when creating new settings files (follow-up)
62+
63+
### Counting Rules
64+
65+
- [x] **Effective lines** = (added lines x 1.0) + (deleted lines x 0.1) + (test lines x 0.5)
66+
- [x] Exclude files matching `exclude_patterns` from line counts entirely
67+
- [x] Include excluded files in the file list but not in thresholds
68+
- [x] Detect test files by path convention: `test/`, `spec/`, `__tests__/`, `*.test.*`, `*.spec.*`
69+
- [x] Detect concerns by clustering changed files into groups by: top-level directory, then by semantic similarity (model+migration = 1 concern, controller+route = 1 concern)
70+
71+
### Documentation
72+
73+
- [x] Update `plugins/compound-engineering/README.md` with new agent
74+
- [x] Update plugin.json description with new agent count
75+
- [x] Update marketplace.json description with new agent count
76+
- [ ] Run `/release-docs` after changes (deferred — docs site rebuild)
77+
78+
## Context
79+
80+
### Why This Matters
81+
82+
PR #536 was a well-intentioned refactor ("remove macOS app surface, make Proof web-only") that grew to 172 files and ~50K lines. It was closed because:
83+
84+
1. Removed web editor agent entry points (`keybindings.ts`, `editor/index.ts`, `context-menu.ts`) with no replacement
85+
2. Removed `@proof` comment/reply trigger plumbing that stopped existing agent workflows
86+
3. Regressed a hosted setup/help route by falling back to wrong skill
87+
88+
All three regressions were in **unrelated areas** -- classic symptom of a PR that mixes too many concerns. No reviewer could reasonably catch all three in a single pass through 172 files.
89+
90+
### Industry Research
91+
92+
- **SmartBear/Cisco**: Reviews of 200-400 LOC find 70-90% of defects. Beyond 400, detection drops sharply.
93+
- **Graphite (1.5M PRs)**: Beyond 3 files, merge time more than doubles.
94+
- **Google eng-practices**: 100 lines ideal, 1000 too large. Stacking is recommended.
95+
- **GitLab**: 500 changes is the hard threshold requiring justification.
96+
97+
### Design Decisions
98+
99+
1. **Non-blocking**: The reviewability check is P2 (informational), not P1 (blocks merge). Teams should adopt gradually.
100+
2. **No automated splitting in v1**: Agent outputs a structured split proposal. Automated branch creation/cherry-picking is a v2 enhancement.
101+
3. **Hardcoded pre-flight**: Not part of user-configurable agent list -- always runs, like a linter. Users configure thresholds, not whether it runs.
102+
4. **Measure branch divergence in ce:work**: Total diff from default branch (not just uncommitted changes), so incremental commits don't reset the counter.
103+
104+
## MVP
105+
106+
### pr-reviewability-analyst.md
107+
108+
```markdown
109+
---
110+
name: pr-reviewability-analyst
111+
description: Analyzes PR size, concern count, and file spread. Proposes stacked PR splits when thresholds are exceeded. Run automatically as pre-flight check during /ce:review.
112+
subagent_type: general-purpose
113+
---
114+
115+
# PR Reviewability Analyst
116+
117+
You are a PR reviewability analyst. Your job is to evaluate whether a pull request is reviewable as-is or should be split into smaller, independently shippable PRs.
118+
119+
## Input
120+
121+
You will receive:
122+
- PR metadata (title, description, files changed, additions, deletions)
123+
- The full diff or file list
124+
125+
## Analysis Steps
126+
127+
### 1. Calculate Effective Size
128+
129+
Measure the PR using weighted counting:
130+
131+
- **Added lines**: weight 1.0
132+
- **Deleted lines**: weight 0.1 (deletions are easy to review)
133+
- **Test file lines**: weight 0.5 (tests accompany implementation)
134+
- **Excluded files**: skip entirely (lock files, generated schemas)
135+
136+
Detect test files by path: `test/`, `spec/`, `__tests__/`, files matching `*.test.*` or `*.spec.*`
137+
138+
### 2. Count Distinct Concerns
139+
140+
Cluster changed files by concern:
141+
- Group by top-level directory first
142+
- Then merge groups that are semantically one concern:
143+
- `db/migrate/` + `app/models/` + model tests = "Schema/Model change"
144+
- `app/controllers/` + `config/routes.rb` + controller tests = "API/Route change"
145+
- `app/views/` + `app/javascript/` + view tests = "UI change"
146+
- `config/` + CI files = "Configuration change"
147+
- Count the resulting groups
148+
149+
A single concern touching multiple directories is 1 concern, not many.
150+
151+
### 3. Score Reviewability
152+
153+
Start at 100, subtract:
154+
- Lines over threshold: -2 per 50 lines over (max -30)
155+
- Files over threshold: -5 per file over (max -30)
156+
- Concerns over threshold: -15 per concern over (max -30)
157+
- No tests for new code: -10
158+
159+
Score interpretation:
160+
- 80-100: Excellent reviewability
161+
- 60-79: Acceptable, proceed with review
162+
- 40-59: Poor, recommend splitting
163+
- 0-39: Unreviewable, strongly recommend splitting
164+
165+
### 4. Generate Split Proposal (if score < 60)
166+
167+
For each identified concern cluster:
168+
1. List the files in that cluster
169+
2. Write a one-sentence PR description
170+
3. Estimate effective line count
171+
4. Identify dependencies on other clusters
172+
5. Order the stack (independent PRs first, dependent PRs later)
173+
174+
Format:
175+
```
176+
## Reviewability Score: [SCORE]/100 — [PASS/FAIL]
177+
178+
### Metrics
179+
- Effective lines: X (threshold: Y)
180+
- Files changed: X (threshold: Y)
181+
- Distinct concerns: X (threshold: Y)
182+
- Test coverage: [present/missing for new code]
183+
184+
### [If FAIL] Proposed Split
185+
186+
**Stack Order:**
187+
188+
PR 1/N: "[one-sentence description]" (base: main)
189+
Files: [list]
190+
~X effective lines, Y files
191+
192+
PR 2/N: "[one-sentence description]" (base: PR 1)
193+
Files: [list]
194+
~X effective lines, Y files
195+
Depends on: PR 1 (needs [specific thing])
196+
197+
...
198+
```
199+
200+
### 5. If PASS
201+
202+
Output only the score summary. No split proposal needed.
203+
204+
## Important
205+
206+
- A PR can be large in lines but still reviewable if it is a single concern (e.g., mechanical rename across 50 files)
207+
- Weight your judgment by concern count more than raw line count
208+
- Deletion-heavy PRs (removing dead code) are generally more reviewable than addition-heavy PRs
209+
- Generated code (migrations, lock files) should be noted but not counted
210+
```
211+
212+
### ce-review SKILL.md integration point
213+
214+
Insert after Step 1 (Determine Review Target & Setup), before the parallel agent dispatch:
215+
216+
```markdown
217+
### 1.5. Pre-Flight Reviewability Check (Always Runs)
218+
219+
Before dispatching review agents, run a quick reviewability analysis:
220+
221+
- Task compound-engineering:review:pr-reviewability-analyst(PR metadata + file list + diff stats)
222+
223+
**Read thresholds from `compound-engineering.local.md`** frontmatter `reviewability` section. If not configured, use defaults: 400 lines, 7 files, 2 concerns.
224+
225+
**If score >= 60**: Log "Reviewability: [SCORE]/100 — PASS" and proceed to agent dispatch.
226+
227+
**If score < 60**:
228+
- Log the full reviewability report (score, metrics, split proposal)
229+
- Add the split recommendation as a P2 finding during synthesis (Step 5)
230+
- **Continue running all review agents** — do not gate or block
231+
- In the summary report, add a "Reviewability" section before the findings
232+
```
233+
234+
### ce-work SKILL.md integration point
235+
236+
Insert into Phase 2 Task Execution Loop, after "Mark task as completed" and before "Evaluate for incremental commit":
237+
238+
```markdown
239+
**Scope Check** — After completing each task, measure total branch size:
240+
241+
```bash
242+
default_branch=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's@^refs/remotes/origin/@@')
243+
[ -z "$default_branch" ] && default_branch=$(git rev-parse --verify origin/main >/dev/null 2>&1 && echo "main" || echo "master")
244+
merge_base=$(git merge-base HEAD origin/$default_branch)
245+
246+
# Get stats excluding generated files
247+
stats=$(git diff $merge_base..HEAD --stat --stat-width=999 | tail -1)
248+
# Parse: "X files changed, Y insertions(+), Z deletions(-)"
249+
```
250+
251+
Read thresholds from `compound-engineering.local.md` frontmatter `reviewability` section. Defaults: soft=300, hard=500.
252+
253+
| Effective Lines | Files | Action |
254+
|----------------|-------|--------|
255+
| < soft threshold | < 5 | Continue silently |
256+
| >= soft threshold | >= 5 | Warn: "Scope check: ~X effective lines across Y files. Consider checkpointing before continuing." |
257+
| >= hard threshold | >= 10 | Strongly recommend: "This branch has grown to ~X lines across Y files. Recommend creating a checkpoint PR now." Offer: (1) Create checkpoint commit + push + new stacked branch, (2) Continue anyway |
258+
259+
After user declines a hard warning, suppress until 100+ additional effective lines accumulate.
260+
261+
In swarm mode: skip scope checks (team lead should monitor manually).
262+
```
263+
264+
## Sources
265+
266+
- PR #536: https://github.com/EveryInc/proof/pull/536
267+
- SmartBear/Cisco study on code review effectiveness
268+
- Graphite research on 1.5M PRs: file count strongly correlates with merge time
269+
- Google eng-practices: https://google.github.io/eng-practices/review/developer/small-cls.html
270+
- GitLab contribution guide: 500-change hard threshold
271+
- Martin Fowler: Feature Toggles, Branch by Abstraction
272+
- Existing plugin: `plugins/compound-engineering/skills/ce-review/SKILL.md`
273+
- Existing plugin: `plugins/compound-engineering/skills/ce-work/SKILL.md`

plugins/compound-engineering/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Agents are organized into categories for easier discovery.
3131
| `kieran-typescript-reviewer` | TypeScript code review with strict conventions |
3232
| `pattern-recognition-specialist` | Analyze code for patterns and anti-patterns |
3333
| `performance-oracle` | Performance analysis and optimization |
34+
| `pr-reviewability-analyst` | Analyze PR size, concern count, and file spread; propose splits |
3435
| `schema-drift-detector` | Detect unrelated schema.rb changes in PRs |
3536
| `security-sentinel` | Security audits and vulnerability assessments |
3637

0 commit comments

Comments
 (0)