| name | copilot-review-analyst |
|---|---|
| description | Analyze GitHub Copilot code review effectiveness across Android Auth repositories. Collects all Copilot inline review comments via GitHub API, classifies them as helpful/not-helpful/unresolved through reply analysis, diff verification, and AI-assisted classification, then generates a report with per-repo and per-engineer statistics. Use this skill when asked to "analyze Copilot reviews", "measure Copilot review effectiveness", "generate Copilot review report", "how helpful are Copilot reviews", "run review analysis", or any request to measure/report on GitHub Copilot code review quality and adoption. |
Analyze GitHub Copilot code review effectiveness across the Android Auth repositories by collecting all inline review comments, classifying each one, and producing a comprehensive report.
- GitHub CLI (
gh) authenticated with an EMU account (*_microsoft) which has access to all three repos (public and private). The script auto-detects the EMU account fromgh auth status. - The
analyze.ps1script automatically switches to the EMU account at startup and restores the original account on completion — no manual auth switching needed - Output directory:
~/.copilot-review-analysis/for final artifacts,$env:TEMP\copilot-review-analysis\for intermediate data
Default repos (update in scripts if changed):
| Label | Slug | Auth |
|---|---|---|
| common | AzureAD/microsoft-authentication-library-common-for-android |
EMU (also accessible via personal) |
| msal | AzureAD/microsoft-authentication-library-for-android |
EMU (also accessible via personal) |
| broker | identity-authnz-teams/ad-accounts-for-android |
EMU only |
| authenticator | AzureAD/microsoft-authenticator-for-android |
EMU (also accessible via personal) |
The analysis runs in 5 sequential phases. Scripts and templates are bundled in this skill:
- Scripts:
scripts/(3 core pipeline scripts) - Assets:
assets/(report templates — Markdown, HTML, Outlook HTML) - References:
references/(classification rules, report formatting guide)
Script: scripts/analyze.ps1
Collect all Copilot inline review comments from human-authored PRs:
# Default: last 60 days
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1
# Custom date range:
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1 -StartDate "2026-01-23"Parameters:
-StartDate— Start date for PR search (default: 60 days ago). Format:YYYY-MM-DD-OutputDir— Output directory (default:$env:TEMP\copilot-review-analysis)
What it does:
- Fetch all PRs created after
-StartDateviagh pr list - Filter out bot-authored PRs (Copilot, copilot-swe-agent, dependabot, github-actions)
- For each PR, call
repos/{slug}/pulls/{prNum}/commentsto get inline comments - Filter to Copilot comments (user.login = "Copilot") that are top-level (not replies)
- Find human replies to each Copilot comment (matched via
in_reply_to_id) - Record whether each comment has a reply and capture the reply text (no classification at this stage)
Outputs:
$env:TEMP\copilot-review-analysis\raw_results.json— all comments withHasReplyflag and raw reply text$env:TEMP\copilot-review-analysis\review_summaries.json— PR-level summary comments (for reference)
Script: scripts/precise.ps1
For every comment with no reply, verify whether the engineer silently acted on the feedback:
.\.github\skills\copilot-review-analyst\scripts\precise.ps1What it does:
- Load
raw_results.json, filter to comments whereHasReply = false - For each comment, get the commit SHA it was left on and the PR head SHA
- Use
repos/{slug}/compare/{commitA}...{commitB}to get the diff - For suggestion blocks: extract code tokens, check if they appear as
+lines in the diff - For prose comments: check if diff hunk line ranges overlap the comment's line range (±5 line tolerance)
Verdicts assigned:
suggestion-applied— suggestion tokens match diff + lines overlapsuggestion-likely-applied— tokens match but lines don't overlapexact-lines-modified— prose comment's lines were modifiedlines-modified-different-fix— nearby lines modified, different codefile-changed-elsewhere— file modified but at different linesfile-changed-no-line-info— file modified but comment had no line numberfile-not-changed— file untouched after the commentno-subsequent-commits— PR merged without any commits after the review
Output: $env:TEMP\copilot-review-analysis\precise.json
This phase is performed by the agent (you) in conversation. Classify every replied comment by reading the full Copilot comment and engineer reply in context.
See references/classification-rules.md for detailed guidance on what counts as helpful vs not-helpful.
Process:
- Load
raw_results.jsonand filter toHasReply -eq true - For each comment, read the
CommentBody(Copilot's feedback) andHumanReplyText(engineer's reply) - Classify as
helpfulornot-helpfulbased on the engineer's intent (read the full reply, don't keyword-match) - Also review
file-changed-elsewhereandfile-changed-no-line-infoverdicts from Phase 2 to identify re-audit flips
Outputs (write to $env:TEMP\copilot-review-analysis\):
reply-verdicts.json—{ "commentId": "helpful"|"not-helpful", ... }for every replied commentreaudit-flips.json—{ "reauditFlipKeys": ["repo/prNum/filePattern", ...] }for no-reply comments with strong evidence
See references/manual-audit-template.json for the schema.
Script: scripts/final-classification.ps1
Merge all results into a single authoritative dataset:
.\.github\skills\copilot-review-analyst\scripts\final-classification.ps1 `
-AccountMapFile ".github\skills\copilot-review-analyst\references\account-map.json" `
-ReplyVerdictsFile "$env:TEMP\copilot-review-analysis\reply-verdicts.json" `
-ReauditFlipsFile "$env:TEMP\copilot-review-analysis\reaudit-flips.json"Parameters:
-OutputDir— Directory withraw_results.jsonandprecise.json(default:$env:TEMP\copilot-review-analysis)-AccountMapFile— Path to JSON mapping GitHub logins to display names. Seereferences/account-map.json.-ReplyVerdictsFile— Path to JSON with Phase 3 AI verdicts for replied comments. If omitted, replied comments are "unknown".-ReauditFlipsFile— Path to JSON with Phase 3 re-audit flips for no-reply comments. If omitted, file-changed-elsewhere defaults to "not-helpful".
What it does:
- Load
raw_results.jsonandprecise.json - Load account mapping, reply verdicts (Phase 3), and re-audit flips (Phase 3)
- For replied comments: use the AI verdict from
reply-verdicts.json - For no-reply comments: use the Phase 2 diff verdict, with re-audit overrides
- Produce per-engineer and per-repo statistics
Output: $env:TEMP\copilot-review-analysis\final_classification.json
Script: scripts/append-history.ps1
Append a snapshot of the current run to the persistent history file for trend tracking across runs:
.\.github\skills\copilot-review-analyst\scripts\append-history.ps1 `
-PeriodStart "2026-01-24" `
-PeriodEnd "2026-03-25"Parameters:
-PeriodStart— First day of the analysis period (YYYY-MM-DD). This is the-StartDatethat was passed toanalyze.ps1in Phase 1. Required.-PeriodEnd— Last day of the analysis period (YYYY-MM-DD). Defaults to today.-InputDir— Directory withfinal_classification.jsonandprecise.json(default:$env:TEMP\copilot-review-analysis)-HistoryFile— Path to persistent history JSON (default:~/.copilot-review-analysis/history.json)
Important: Periods are non-overlapping and variable-length. Each run covers a distinct period — e.g., the first run might cover 60 days, subsequent runs might cover 2 weeks each. The -PeriodStart for a new run should be the day after the previous run's -PeriodEnd. The script automatically deduplicates entries with the same period.
What it does:
- Load
final_classification.jsonandprecise.json - Compute aggregate stats: response rate, three-way breakdown, per-repo, per-engineer
- Compute normalized volume:
comments/weekfor fair comparison across different period lengths - Load existing
history.json(or create empty if first run) - Append current snapshot; replace if same period already exists
- Save sorted by period (newest first)
Output: ~/.copilot-review-analysis/history.json
The history file is an array of snapshots. Each snapshot contains:
- Period metadata:
runDate,periodStart,periodEnd,periodDays - Volume:
total,commentsPerWeek,reviewedPRs,avgCommentsPerPR - Rates:
responseRate,helpful.pct,notHelpful.pct,unresolved.pct,repliedHelpfulRate - Per-repo breakdown:
repos.{broker,common,msal}.{comments, responseRate, helpfulPct, ...} - Per-engineer breakdown:
engineers.{name}.{comments, responseRate, helpfulPct}
Generate both Markdown and Outlook-compatible HTML reports.
Style/structure references (in assets/ — these contain data from the Jan-Mar 2026 analysis and serve as structural templates, NOT to be copied verbatim):
assets/Copilot-Code-Review-Effectiveness-Report.md— Markdown reference (~270 lines, ~3300 words)assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html— Outlook HTML reference (~430 lines, ~42KB)assets/Copilot-Code-Review-Effectiveness-Report.html— Standard HTML reference
Important: The asset templates contain hardcoded numbers (557 comments, specific percentages, engineer names, etc.) from the first analysis. For each new run, generate fresh reports using the same section structure and formatting patterns but with statistics computed from final_classification.json.
You MUST read both the Markdown and Outlook HTML template files in full before generating any report. Do not generate from memory or from the section table in report-formatting.md alone — the templates contain critical patterns that are not captured in the section list:
- Narrative prose paragraphs between every table explaining the significance of the data (not just "here's a table")
- Callout boxes (blue for insights, yellow for warnings) that frame key findings
- Background section with team context and Copilot enablement framing
- "At a Glance" summary cards with a detailed adoption callout underneath
- Scope metric cards (Human PRs, PRs reviewed, Total comments, Avg per PR)
- Full Copilot comment text in examples (not truncated snippets)
- Explanatory paragraphs after examples describing why the comment was helpful/unhelpful
- Detailed Recommendations section with prose paragraphs (not bullet points)
- Bar chart visualizations (response rate bar, helpfulness verdict bar, per-repo bars) in HTML
After generating each report, compare its dimensions against the template:
- Markdown: Must be ≥250 lines and ≥3000 words. If significantly smaller, the report is missing narrative depth.
- Outlook HTML: Must be ≥400 lines and ≥35KB. If significantly smaller, the report is missing visual elements or prose.
If a report doesn't meet these thresholds, re-read the template and identify what's missing before saving.
Generate two versions of each report:
- Team-internal — uses real engineer names (from account map). For the team.
- Org-wide — anonymizes engineers as "Engineer A", "Engineer B", etc., sorted by helpfulness descending. For sharing outside the team.
Process:
- Load
final_classification.json - Read full template files (
assets/Copilot-Code-Review-Effectiveness-Report.mdandassets/Copilot-Code-Review-Effectiveness-Report-Outlook.html) — do NOT skip this step - Compute aggregate statistics (total, per-repo, per-engineer, three-way breakdown)
- Load
raw_results.jsonto collect full comment text for 4-5 helpful and 4-5 unhelpful examples - Load
~/.copilot-review-analysis/history.jsonfor trend data. If ≥2 entries exist, generate a Trend section comparing the current run with the previous run(s). See references/report-formatting.md for trend section formatting rules. - Generate reports matching the template's section structure, narrative depth, and visual formatting
- Verify dimensions (word count, line count) against the quality gate thresholds
- Save to
~/.copilot-review-analysis/:Copilot-Code-Review-Effectiveness-Report.md(team, real names)Copilot-Code-Review-Effectiveness-Report-Anonymous.md(org-wide)Copilot-Code-Review-Effectiveness-Report-Outlook.html(team, real names)Copilot-Code-Review-Effectiveness-Report-Outlook-Anonymous.html(org-wide)
See references/report-formatting.md for the report structure and Outlook HTML formatting rules.
Comments with no reply and no diff evidence are Unresolved, not assumed unhelpful. This is a critical distinction:
- Confirmed Helpful = positive evidence (explicit acknowledgment OR verified fix in diff)
- Confirmed Not Helpful = positive evidence (explicit dismissal with stated reason OR comment on stale code)
- Unresolved = insufficient evidence either way (engineer never engaged)
The final-classification.ps1 script classifies no-response/no-diff-evidence comments as "not-helpful" for conservative stats, but the report should present the three-way breakdown to be honest about uncertainty.
- Copilot inline review comments use
user.login = "Copilot" - Legacy bot:
copilot-pull-request-reviewer[bot] - Bot PR authors to exclude:
app/copilot-swe-agent,dependabot[bot],github-actions[bot] - Only count top-level comments (
in_reply_to_idis null/0), not Copilot's own replies
The scripts call the GitHub API heavily. Built-in mitigations:
- PR comment caching (fetched once per PR)
- Diff caching (fetched once per commit range)
- Sleep every 15 PRs (300ms) and every 25 diff checks (200ms)
- Use
--paginatefor repos with many comments
If hitting rate limits, increase sleep intervals or use GH_TOKEN with higher rate limits.