name	copilot-review-analyst
description	Analyze GitHub Copilot code review effectiveness across Android Auth repositories. Collects all Copilot inline review comments via GitHub API, classifies them as helpful/not-helpful/unresolved through reply analysis, diff verification, and AI-assisted classification, then generates a report with per-repo and per-engineer statistics. Use this skill when asked to "analyze Copilot reviews", "measure Copilot review effectiveness", "generate Copilot review report", "how helpful are Copilot reviews", "run review analysis", or any request to measure/report on GitHub Copilot code review quality and adoption.

name

copilot-review-analyst

description

Analyze GitHub Copilot code review effectiveness across Android Auth repositories. Collects all Copilot inline review comments via GitHub API, classifies them as helpful/not-helpful/unresolved through reply analysis, diff verification, and AI-assisted classification, then generates a report with per-repo and per-engineer statistics. Use this skill when asked to "analyze Copilot reviews", "measure Copilot review effectiveness", "generate Copilot review report", "how helpful are Copilot reviews", "run review analysis", or any request to measure/report on GitHub Copilot code review quality and adoption.

Copilot Review Analyst

Analyze GitHub Copilot code review effectiveness across the Android Auth repositories by collecting all inline review comments, classifying each one, and producing a comprehensive report.

Prerequisites

GitHub CLI (gh) authenticated with an EMU account (*_microsoft) which has access to all three repos (public and private). The script auto-detects the EMU account from gh auth status.
The analyze.ps1 script automatically switches to the EMU account at startup and restores the original account on completion — no manual auth switching needed
Output directory: ~/.copilot-review-analysis/ for final artifacts, $env:TEMP\copilot-review-analysis\ for intermediate data

Repository Configuration

Default repos (update in scripts if changed):

Label	Slug	Auth
common	`AzureAD/microsoft-authentication-library-common-for-android`	EMU (also accessible via personal)
msal	`AzureAD/microsoft-authentication-library-for-android`	EMU (also accessible via personal)
broker	`identity-authnz-teams/ad-accounts-for-android`	EMU only
authenticator	`AzureAD/microsoft-authenticator-for-android`	EMU (also accessible via personal)

Analysis Pipeline

The analysis runs in 5 sequential phases. Scripts and templates are bundled in this skill:

Scripts: scripts/ (3 core pipeline scripts)
Assets: assets/ (report templates — Markdown, HTML, Outlook HTML)
References: references/ (classification rules, report formatting guide)

Phase 1: Data Collection

Script: scripts/analyze.ps1

Collect all Copilot inline review comments from human-authored PRs:

# Default: last 60 days
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1

# Custom date range:
.\.github\skills\copilot-review-analyst\scripts\analyze.ps1 -StartDate "2026-01-23"

Parameters:

-StartDate — Start date for PR search (default: 60 days ago). Format: YYYY-MM-DD
-OutputDir — Output directory (default: $env:TEMP\copilot-review-analysis)

What it does:

Fetch all PRs created after -StartDate via gh pr list
Filter out bot-authored PRs (Copilot, copilot-swe-agent, dependabot, github-actions)
For each PR, call repos/{slug}/pulls/{prNum}/comments to get inline comments
Filter to Copilot comments (user.login = "Copilot") that are top-level (not replies)
Find human replies to each Copilot comment (matched via in_reply_to_id)
Record whether each comment has a reply and capture the reply text (no classification at this stage)

Outputs:

$env:TEMP\copilot-review-analysis\raw_results.json — all comments with HasReply flag and raw reply text
$env:TEMP\copilot-review-analysis\review_summaries.json — PR-level summary comments (for reference)

Phase 2: Diff-Level Verification

Script: scripts/precise.ps1

For every comment with no reply, verify whether the engineer silently acted on the feedback:

.\.github\skills\copilot-review-analyst\scripts\precise.ps1

What it does:

Load raw_results.json, filter to comments where HasReply = false
For each comment, get the commit SHA it was left on and the PR head SHA
Use repos/{slug}/compare/{commitA}...{commitB} to get the diff
For suggestion blocks: extract code tokens, check if they appear as + lines in the diff
For prose comments: check if diff hunk line ranges overlap the comment's line range (±5 line tolerance)

Verdicts assigned:

suggestion-applied — suggestion tokens match diff + lines overlap
suggestion-likely-applied — tokens match but lines don't overlap
exact-lines-modified — prose comment's lines were modified
lines-modified-different-fix — nearby lines modified, different code
file-changed-elsewhere — file modified but at different lines
file-changed-no-line-info — file modified but comment had no line number
file-not-changed — file untouched after the comment
no-subsequent-commits — PR merged without any commits after the review

Output: $env:TEMP\copilot-review-analysis\precise.json

Phase 3: AI Reply Classification

This phase is performed by the agent (you) in conversation. Classify every replied comment by reading the full Copilot comment and engineer reply in context.

See references/classification-rules.md for detailed guidance on what counts as helpful vs not-helpful.

Process:

Load raw_results.json and filter to HasReply -eq true
For each comment, read the CommentBody (Copilot's feedback) and HumanReplyText (engineer's reply)
Classify as helpful or not-helpful based on the engineer's intent (read the full reply, don't keyword-match)
Also review file-changed-elsewhere and file-changed-no-line-info verdicts from Phase 2 to identify re-audit flips

Outputs (write to $env:TEMP\copilot-review-analysis\):

reply-verdicts.json — { "commentId": "helpful"|"not-helpful", ... } for every replied comment
reaudit-flips.json — { "reauditFlipKeys": ["repo/prNum/filePattern", ...] } for no-reply comments with strong evidence

See references/manual-audit-template.json for the schema.

Phase 4: Final Classification

Script: scripts/final-classification.ps1

Merge all results into a single authoritative dataset:

.\.github\skills\copilot-review-analyst\scripts\final-classification.ps1 `
    -AccountMapFile ".github\skills\copilot-review-analyst\references\account-map.json" `
    -ReplyVerdictsFile "$env:TEMP\copilot-review-analysis\reply-verdicts.json" `
    -ReauditFlipsFile "$env:TEMP\copilot-review-analysis\reaudit-flips.json"

Parameters:

-OutputDir — Directory with raw_results.json and precise.json (default: $env:TEMP\copilot-review-analysis)
-AccountMapFile — Path to JSON mapping GitHub logins to display names. See references/account-map.json.
-ReplyVerdictsFile — Path to JSON with Phase 3 AI verdicts for replied comments. If omitted, replied comments are "unknown".
-ReauditFlipsFile — Path to JSON with Phase 3 re-audit flips for no-reply comments. If omitted, file-changed-elsewhere defaults to "not-helpful".

What it does:

Load raw_results.json and precise.json
Load account mapping, reply verdicts (Phase 3), and re-audit flips (Phase 3)
For replied comments: use the AI verdict from reply-verdicts.json
For no-reply comments: use the Phase 2 diff verdict, with re-audit overrides
Produce per-engineer and per-repo statistics

Output: $env:TEMP\copilot-review-analysis\final_classification.json

Phase 4.5: Append History

Script: scripts/append-history.ps1

Append a snapshot of the current run to the persistent history file for trend tracking across runs:

.\.github\skills\copilot-review-analyst\scripts\append-history.ps1 `
    -PeriodStart "2026-01-24" `
    -PeriodEnd "2026-03-25"

Parameters:

-PeriodStart — First day of the analysis period (YYYY-MM-DD). This is the -StartDate that was passed to analyze.ps1 in Phase 1. Required.
-PeriodEnd — Last day of the analysis period (YYYY-MM-DD). Defaults to today.
-InputDir — Directory with final_classification.json and precise.json (default: $env:TEMP\copilot-review-analysis)
-HistoryFile — Path to persistent history JSON (default: ~/.copilot-review-analysis/history.json)

Important: Periods are non-overlapping and variable-length. Each run covers a distinct period — e.g., the first run might cover 60 days, subsequent runs might cover 2 weeks each. The -PeriodStart for a new run should be the day after the previous run's -PeriodEnd. The script automatically deduplicates entries with the same period.

What it does:

Load final_classification.json and precise.json
Compute aggregate stats: response rate, three-way breakdown, per-repo, per-engineer
Compute normalized volume: comments/week for fair comparison across different period lengths
Load existing history.json (or create empty if first run)
Append current snapshot; replace if same period already exists
Save sorted by period (newest first)

Output: ~/.copilot-review-analysis/history.json

The history file is an array of snapshots. Each snapshot contains:

Period metadata: runDate, periodStart, periodEnd, periodDays
Volume: total, commentsPerWeek, reviewedPRs, avgCommentsPerPR
Rates: responseRate, helpful.pct, notHelpful.pct, unresolved.pct, repliedHelpfulRate
Per-repo breakdown: repos.{broker,common,msal}.{comments, responseRate, helpfulPct, ...}
Per-engineer breakdown: engineers.{name}.{comments, responseRate, helpfulPct}

Phase 5: Report Generation

Generate both Markdown and Outlook-compatible HTML reports.

Style/structure references (in assets/ — these contain data from the Jan-Mar 2026 analysis and serve as structural templates, NOT to be copied verbatim):

assets/Copilot-Code-Review-Effectiveness-Report.md — Markdown reference (~270 lines, ~3300 words)
assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html — Outlook HTML reference (~430 lines, ~42KB)
assets/Copilot-Code-Review-Effectiveness-Report.html — Standard HTML reference

Important: The asset templates contain hardcoded numbers (557 comments, specific percentages, engineer names, etc.) from the first analysis. For each new run, generate fresh reports using the same section structure and formatting patterns but with statistics computed from final_classification.json.

MANDATORY: Read Templates Before Generating

You MUST read both the Markdown and Outlook HTML template files in full before generating any report. Do not generate from memory or from the section table in report-formatting.md alone — the templates contain critical patterns that are not captured in the section list:

Narrative prose paragraphs between every table explaining the significance of the data (not just "here's a table")
Callout boxes (blue for insights, yellow for warnings) that frame key findings
Background section with team context and Copilot enablement framing
"At a Glance" summary cards with a detailed adoption callout underneath
Scope metric cards (Human PRs, PRs reviewed, Total comments, Avg per PR)
Full Copilot comment text in examples (not truncated snippets)
Explanatory paragraphs after examples describing why the comment was helpful/unhelpful
Detailed Recommendations section with prose paragraphs (not bullet points)
Bar chart visualizations (response rate bar, helpfulness verdict bar, per-repo bars) in HTML

Quality Gate

After generating each report, compare its dimensions against the template:

Markdown: Must be ≥250 lines and ≥3000 words. If significantly smaller, the report is missing narrative depth.
Outlook HTML: Must be ≥400 lines and ≥35KB. If significantly smaller, the report is missing visual elements or prose.

If a report doesn't meet these thresholds, re-read the template and identify what's missing before saving.

Generate two versions of each report:

Team-internal — uses real engineer names (from account map). For the team.
Org-wide — anonymizes engineers as "Engineer A", "Engineer B", etc., sorted by helpfulness descending. For sharing outside the team.

Process:

Load final_classification.json
Read full template files (assets/Copilot-Code-Review-Effectiveness-Report.md and assets/Copilot-Code-Review-Effectiveness-Report-Outlook.html) — do NOT skip this step
Compute aggregate statistics (total, per-repo, per-engineer, three-way breakdown)
Load raw_results.json to collect full comment text for 4-5 helpful and 4-5 unhelpful examples
Load ~/.copilot-review-analysis/history.json for trend data. If ≥2 entries exist, generate a Trend section comparing the current run with the previous run(s). See references/report-formatting.md for trend section formatting rules.
Generate reports matching the template's section structure, narrative depth, and visual formatting
Verify dimensions (word count, line count) against the quality gate thresholds
Save to ~/.copilot-review-analysis/:
- Copilot-Code-Review-Effectiveness-Report.md (team, real names)
- Copilot-Code-Review-Effectiveness-Report-Anonymous.md (org-wide)
- Copilot-Code-Review-Effectiveness-Report-Outlook.html (team, real names)
- Copilot-Code-Review-Effectiveness-Report-Outlook-Anonymous.html (org-wide)

See references/report-formatting.md for the report structure and Outlook HTML formatting rules.

Key Principle: "Unresolved" ≠ "Not Helpful"

Comments with no reply and no diff evidence are Unresolved, not assumed unhelpful. This is a critical distinction:

Confirmed Helpful = positive evidence (explicit acknowledgment OR verified fix in diff)
Confirmed Not Helpful = positive evidence (explicit dismissal with stated reason OR comment on stale code)
Unresolved = insufficient evidence either way (engineer never engaged)

The final-classification.ps1 script classifies no-response/no-diff-evidence comments as "not-helpful" for conservative stats, but the report should present the three-way breakdown to be honest about uncertainty.

Copilot Comment Identification

Copilot inline review comments use user.login = "Copilot"
Legacy bot: copilot-pull-request-reviewer[bot]
Bot PR authors to exclude: app/copilot-swe-agent, dependabot[bot], github-actions[bot]
Only count top-level comments (in_reply_to_id is null/0), not Copilot's own replies

Rate Limiting

The scripts call the GitHub API heavily. Built-in mitigations:

PR comment caching (fetched once per PR)
Diff caching (fetched once per commit range)
Sleep every 15 PRs (300ms) and every 25 diff checks (200ms)
Use --paginate for repos with many comments

If hitting rate limits, increase sleep intervals or use GH_TOKEN with higher rate limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copilot Review Analyst

Prerequisites

Repository Configuration

Analysis Pipeline

Phase 1: Data Collection

Phase 2: Diff-Level Verification

Phase 3: AI Reply Classification

Phase 4: Final Classification

Phase 4.5: Append History

Phase 5: Report Generation

MANDATORY: Read Templates Before Generating

Quality Gate

Key Principle: "Unresolved" ≠ "Not Helpful"

Copilot Comment Identification

Rate Limiting

FilesExpand file tree

SKILL.md

Latest commit

History

SKILL.md

File metadata and controls

Copilot Review Analyst

Prerequisites

Repository Configuration

Analysis Pipeline

Phase 1: Data Collection

Phase 2: Diff-Level Verification

Phase 3: AI Reply Classification

Phase 4: Final Classification

Phase 4.5: Append History

Phase 5: Report Generation

MANDATORY: Read Templates Before Generating

Quality Gate

Key Principle: "Unresolved" ≠ "Not Helpful"

Copilot Comment Identification

Rate Limiting