| status | in_progress |
|---|---|
| domain | ai-tooling |
| created | 2026-03-20 |
| last_updated | 2026-03-22 |
| owner | fruch |
Objective: Mine 7 years of SCT PR review comments (~6,500 PRs with reviews out of ~11,900 total), classify them into a data-driven taxonomy, produce a trend report with charts, and use the findings to design the code-review skill and identify what other skills are needed.
Deliverables:
- Data collection script (throttled, resumable, cached)
- Classification engine (regex + keyword, ~75% accuracy target)
- Trend report with per-quarter charts per issue category
- Updated code-review skill with data-backed priority ordering
- Repeatable pipeline runnable quarterly
Estimated Effort: Large Critical Path: Collect → Classify → Analyze → Report → Skill updates
Go over all SCT PR review comments across the full project history, classify them into common issues, design changes to the review skill to address the most common ones, document the process for repeatability, and suggest additional skills to improve the review process.
A pilot of ~45 PRs (Aug 2024 – Mar 2026, ~600+ comments) identified 13 categories:
| # | Category | Pilot Count | Trajectory |
|---|---|---|---|
| 1 | Test Quality (pytest, fixtures, parametrize) | 38 | PERSISTENT ⬆ |
| 2 | AI-Generated Code Quality | 28 | EMERGED Jan 2026 ⬆ |
| 3 | Missing Documentation | 19 | PERSISTENT → |
| 4 | Plan Review Issues | 18 | EMERGED Feb 2026 ⬇ |
| 5 | Code Structure / Simplification | 16 | PERSISTENT → |
| 6 | Unused Imports (automated by CodeQL) | 13 | AUTOMATED ✓ |
| 7 | Scope / PR Hygiene | 12 | PERSISTENT → |
| 8 | Error Handling | 9 | INTERMITTENT |
| 9 | Missing Unit Tests | 8 | PERSISTENT → |
| 10 | Consistency / Unification | 8 | PERSISTENT → |
| 11 | Configuration System | 8 | PERSISTENT → |
| 12 | Dead / Commented-Out Code | 7 | DECLINING ⬇ |
| 13 | Naming / Abstraction | 5 | PERSISTENT → |
| Year | Total PRs | Est. PRs w/ comments | Notes |
|---|---|---|---|
| 2019 | 757 | ~430 | First substantial year |
| 2020 | 1,245 | ~650 | Growth |
| 2021 | 1,022 | ~550 | Stable |
| 2022 | 1,034 | ~900 | Maturing |
| 2023 | 1,109 | ~950 | Pre-AI era |
| 2024 | 2,177 | ~940 | AI adoption begins |
| 2025 | 2,974 | ~1,624 | Heavy AI PRs |
| 2026 | 989 (Q1) | ~500 | AI review infra live |
| Total | ~11,900 | ~6,500 |
- GitHub API rate limit: 5,000 requests/hour (authenticated via
gh) - Each PR needs ~2 API calls (comments + reviews)
- Full 6,500-PR scan ≈ ~13,000 calls
- At gentle pace (1 req/sec): ~3.6 hours wall time
- At very gentle pace (1 req/2sec): ~7.2 hours wall time
- Results MUST be cached so re-runs cost zero for already-fetched PRs
Produce a quantitative, data-driven taxonomy of PR review issues across 7 years, with temporal trend analysis and actionable skill design recommendations.
scripts/review_analysis/collect.py— Throttled data collectorscripts/review_analysis/classify.py— Comment classifierscripts/review_analysis/analyze.py— Trend aggregatorscripts/review_analysis/report.py— Report + chart generatorscripts/review_analysis/taxonomy.yaml— Category definitions + regex patternsscripts/review_analysis/run.sh— One-command pipelinescripts/review_analysis/README.md— Process documentation- Report output (markdown + charts)
- Throttled API collection (max 1 request/second, configurable)
- Resume capability (re-running skips already-cached PRs)
- Local JSON cache per PR
- At least 13 categories from pilot taxonomy
- Per-quarter trend data for each category
- Trend charts (Mermaid or matplotlib SVG)
- Reviewer specialization matrix
- AI vs human author comparison
- No real-time dashboard (batch reports only)
- No database dependency (flat JSON files)
- No changes to GitHub workflows or CI
- No external service dependencies beyond
ghCLI
- Infrastructure exists: YES (pytest)
- Automated tests: YES (tests-after)
- Framework: pytest
- Unit tests for classifier regex patterns
- Integration test on 50 known-classified PRs from pilot
- Manual accuracy check on 30 random comments per quarter-bin
Wave 1 (Foundation — can all start immediately):
├── Task 1: Taxonomy YAML definition [quick]
├── Task 2: Data collection script skeleton [deep]
├── Task 3: Classification regex patterns [unspecified-high]
Wave 2 (After Wave 1 — core pipeline):
├── Task 4: Collection script full implementation (depends: 2) [deep]
├── Task 5: Classifier implementation (depends: 1, 3) [deep]
├── Task 6: Unit tests for classifier (depends: 3) [quick]
Wave 3 (After Wave 2 — analysis + reporting):
├── Task 7: Trend aggregation script (depends: 5) [unspecified-high]
├── Task 8: Report generator with charts (depends: 7) [unspecified-high]
├── Task 9: Pipeline wrapper script + README (depends: 4, 8) [quick]
Wave 4 (After data is collected — skill design):
├── Task 10: Run full collection (depends: 4) [NOTE: long-running, ~4-7 hours]
├── Task 11: Run classification + analysis (depends: 5, 7, 10) [quick]
├── Task 12: Generate report (depends: 8, 11) [quick]
├── Task 13: Skill design recommendations document (depends: 12) [writing]
Wave FINAL (Verification):
├── F1: Verify report accuracy against pilot data [unspecified-high]
├── F2: Verify pipeline re-run uses cache [quick]
├── F3: Review report completeness [deep]
-
1. Define taxonomy YAML with categories and regex patterns
What to do:
- Create
scripts/review_analysis/taxonomy.yamldefining all 13 categories - Each category: id, name, description, regex patterns (list), keywords (list)
- Include the 13 categories from pilot + a T0_UNCATEGORIZED catch-all
- Patterns derived from actual review comments observed in pilot
References:
.sisyphus/drafts/pr-review-taxonomy.md— Full pilot analysis with examples per category- Classification rubric table in that draft (detection keywords per category)
Acceptance Criteria:
- YAML file parses without error
- All 13 categories defined with ≥ 2 regex patterns each
- Patterns tested against 10 known examples from pilot
- Create
-
2. Build throttled, resumable data collection script
What to do:
- Create
scripts/review_analysis/collect.py - Uses
subprocessto callgh api(avoids Python GitHub library dependency) - Throttling: Configurable delay between API calls, default 1.5 seconds
- Resume: Check cache dir for existing
{pr_number}.json, skip if present - Pagination: Handle GitHub API pagination for PR listing (100 per page)
- Two-pass collection:
- Pass 1: List all PRs (paginate through all pages), save PR metadata
- Pass 2: For each PR, fetch comments + reviews, save to
{pr_number}.json
- Progress reporting: Print progress every 100 PRs
- Error handling: On API error, wait 60s and retry (max 3 retries)
- Rate limit awareness: Check
X-RateLimit-Remainingheader, pause if < 100
CLI interface:
python collect.py \ --repo scylladb/scylla-cluster-tests \ --since 2019-01-01 \ --cache-dir ./cache/raw \ --delay 1.5 \ --resumeData schema per cached file (
cache/raw/{pr_number}.json):{ "pr_number": 14118, "created_at": "2026-03-18T...", "merged_at": "2026-03-18T...", "state": "merged", "author": "roydahan", "title": "feat(tags): add JenkinsJob tag...", "labels": ["backport/2026.1"], "is_ai_author": false, "comments": [...], "reviews": [...] }AI author detection: Flag PRs where author is "Copilot", "claude[bot]", or has
ai-assistedlabel.Must NOT do:
- Do NOT make more than 1 request per
--delayseconds - Do NOT skip the cache check (must be resumable)
- Do NOT store GitHub tokens in the script
Acceptance Criteria:
-
--delay 1.5results in ≤ 1 request per 1.5 seconds (verified by timing 20 requests) - Re-running with
--resumeon a populated cache makes 0 API calls for cached PRs - Handles API 403 rate limit response by waiting and retrying
- Progress output shows
[423/6500] PR#12345 — cachedor[423/6500] PR#12345 — fetched (3 comments, 1 review)
- Create
-
3. Build comment classifier
What to do:
- Create
scripts/review_analysis/classify.py - Loads taxonomy from
taxonomy.yaml - For each comment in each cached PR JSON:
- Run all category regex patterns against comment body
- Assign matching categories (can be multi-label)
- If no match → T0_UNCATEGORIZED
- Bot detection: Auto-tag comments from known bots:
github-code-quality[bot]→ T9 (Unused/Dead Code)copilot-pull-request-reviewer[bot]→ classify normally
- Output: Augmented JSON with
categoriesfield per comment - Stats output: Print classification summary (count per category, uncategorized %)
CLI interface:
python classify.py \ --input ./cache/raw \ --output ./cache/classified \ --taxonomy taxonomy.yamlAcceptance Criteria:
- All cached PRs processed
- Each comment has
categories: [...]field (list of category IDs) - Uncategorized comments < 30% of total (excluding bot comments)
- Classification summary printed to stdout
- Create
-
4. Build trend aggregation script
What to do:
- Create
scripts/review_analysis/analyze.py - Reads classified JSONs
- Produces aggregations:
- Per-quarter counts by category (Q1-2019 through Q1-2026)
- Per-reviewer-per-category matrix
- Human vs AI author breakdown
- Rolling 4-quarter moving average per category
- Inflection point detection: Quarter where count first exceeded 2× prior 4-quarter average
- Output:
report_data.jsonwith all aggregations
CLI interface:
python analyze.py \ --input ./cache/classified \ --output ./cache/report_data.jsonAcceptance Criteria:
-
report_data.jsoncontains all 5 aggregation types - Quarter bins cover 2019-Q1 through current quarter
- At least 10 reviewers appear in reviewer matrix
- Inflection points identified for ≥ 3 categories
- Create
-
5. Build report generator with trend charts
What to do:
- Create
scripts/review_analysis/report.py - Reads
report_data.json - Generates markdown report with:
- Executive summary: Top 3 persistent, conquered, emerging issues
- Overall volume chart: Mermaid xychart of total comments per quarter
- Per-category trend charts: One Mermaid chart per Tier-1 category
- AI impact chart: Stacked bar of human vs AI PR issues
- Category deep dives: Description, count, trend, representative examples
- Reviewer specialization table
- Recommendations: Prioritized list for skill design
Chart format: Mermaid (renders natively in GitHub markdown). Example:
```mermaid xychart-beta title "Test Quality Issues per Quarter" x-axis [Q1-19, Q2-19, ..., Q1-26] y-axis "Occurrences" 0 --> 50 bar [3, 5, 4, 7, ...]Acceptance Criteria:
- Markdown report generated
- At least 3 Mermaid trend charts render correctly
- Executive summary present with top 3 per trajectory type
- Reviewer table has ≥ 5 reviewers with their top categories
- Recommendations section actionable
- Create
-
6. Unit tests for classifier patterns
What to do:
- Create
unit_tests/test_review_classifier.py - Parametrized tests: for each category, test 3+ known-matching comments and 2+ non-matching
- Use actual comment text from pilot as test fixtures
- Test multi-label assignment (comment matching 2 categories)
- Test bot detection
References:
.sisyphus/drafts/pr-review-taxonomy.md— Real comment examples per category
Acceptance Criteria:
- ≥ 39 test cases (3 per 13 categories)
- All tests pass
- Covers multi-label and bot detection edge cases
- Create
-
7. Pipeline wrapper and README
What to do:
- Create
scripts/review_analysis/run.sh— Single command to run full pipeline - Create
scripts/review_analysis/README.mddocumenting:- Purpose and process overview
- Prerequisites (
ghCLI authenticated) - How to run (
./run.sh) - How to re-run quarterly (just re-run, cache handles dedup)
- How to add new taxonomy categories
- How to interpret the report
- Estimated runtime (~4-7 hours for first run, ~30 min for updates)
Acceptance Criteria:
-
run.shexecutes all 4 phases sequentially - README covers all sections listed above
- Re-running uses cache correctly
- Create
-
8. Run full data collection (LONG-RUNNING) — IN PROGRESS in tmux:pr-review-collection
What to do:
- Execute
collect.pywith--delay 2.0for gentle pace - Target: all PRs from 2019-01-01 to present with review comments
- Expected: ~6,500 PRs, ~13,000 API calls, ~7 hours at 2s delay
- Can be run in background (
nohuportmux) - Monitor progress via log output
Must NOT do:
- Do NOT use delay < 1.5 seconds
- Do NOT run during peak working hours if concerned about rate limits
Acceptance Criteria:
- ≥ 5,000 PR JSONs cached
- No API 403 errors in log
- Collection completed without manual intervention
- Execute
-
9. Run classification + analysis + report generation
What to do:
- Run
classify.pyon collected data - Run
analyze.pyto produce aggregations - Run
report.pyto generate final report - Review report for accuracy against pilot findings
- Verify pilot-era data (2024-2026) matches pilot counts within ±20%
Acceptance Criteria:
- Report generated with all required sections
- Pilot-era category counts within ±20% of pilot findings
- At least 2 conquered issues identified
- Mermaid charts render correctly
- Run
-
10. Write skill design recommendations
What to do:
- Based on the report, write a recommendations document covering:
- Code-review skill — priority-ordered checklist based on 7-year data
- AI-code-audit skill — yes/no based on T3 occurrence count (threshold: 50+)
- Plan-review skill — yes/no based on T8 occurrence count (threshold: 30+)
- New categories discovered — any T0 clusters that became new categories
- Automated check opportunities — issues that could be caught by linters/bots
- Save to
.sisyphus/drafts/skill-design-recommendations.md
Acceptance Criteria:
- Recommendations backed by specific occurrence counts
- Priority ordering matches data
- Clear yes/no for each proposed skill with threshold justification
- Based on the report, write a recommendations document covering:
- Commit 1:
feature(review-analysis): add taxonomy definition and collection script- Files:
scripts/review_analysis/taxonomy.yaml,scripts/review_analysis/collect.py
- Files:
- Commit 2:
feature(review-analysis): add classifier and unit tests- Files:
scripts/review_analysis/classify.py,unit_tests/test_review_classifier.py
- Files:
- Commit 3:
feature(review-analysis): add analysis, report generator, and pipeline- Files:
scripts/review_analysis/analyze.py,scripts/review_analysis/report.py,scripts/review_analysis/run.sh,scripts/review_analysis/README.md
- Files:
- Commit 4:
docs(review-analysis): add generated report and skill recommendations- Files: report output, recommendations
# Run classifier unit tests
uv run python -m pytest unit_tests/test_review_classifier.py -v
# Verify cache has data
ls cache/raw/*.json | wc -l # Expected: ≥ 5000
# Verify report renders
# Open report markdown in GitHub or VS Code preview- ≥ 5,000 PRs collected and cached
- ≥ 70% of comments classified (not T0)
- Report has per-quarter trends for all 13 categories
- At least 3 trend charts render correctly
- Pipeline re-runnable via
run.sh - Process documented in README.md
- Skill recommendations backed by data