Status: Living document Parent PRD: docs/PRD.md
This document defines two categories of tests for evaluating Stet's code review quality, model performance, and user experience:
-
Automated tests — Scriptable tests that a coding LLM can implement without human judgment. Each entry includes inputs, outputs, schemas, CLI invocations, and implementation steps.
-
Human tests — Experiments that require human judgment, labeling, or subjective assessment. Each entry includes experiment design, a step-by-step protocol, sample size guidance, and analysis steps.
- Output format: Use
--output=jsonor--jsonfor machine-parseable JSON on stdout. See cli-extension-contract.md. - Streaming: Use
--streamwith--jsonfor NDJSON events (progress,finding,done). - Exit codes: 0 = success, 1 = usage/error, 2 = Ollama unreachable.
- JSON shape:
{"findings": [...]}; each finding hasid,file,line,range,severity,category,confidence,message,suggestion,cursor_uri.
- Model:
STET_MODELor.review/config.tomlmodel = "...". Default:qwen3-coder:30b. See cli/internal/config/config.go. - RAG:
STET_RAG_SYMBOL_MAX_DEFINITIONS(default 10; 0 = disable) or--rag-symbol-max-definitions=0onstet start/stet run. - Temperature:
STET_TEMPERATURE(default 0.2); use 0 for deterministic runs.
- Finding: cli/internal/findings/finding.go —
id,file,line,range,severity,category,confidence,message,suggestion,cursor_uri. - History Record: cli/internal/history/schema.go —
diff_ref,review_output,user_action.dismissed_ids,user_action.dismissals[]withfinding_idandreason. - Dismissal reasons:
false_positive,already_correct,wrong_suggestion,out_of_scope.
.review/session.json— Session state (baseline, findings, dismissed_ids)..review/history.jsonl— One JSON object per line; appended on dismiss and finish.
stet start [ref]— Start review; default ref is HEAD.stet run— Incremental re-review.stet finish— End session, remove worktree.stet dismiss <id> [reason]— Mark finding as dismissed with optional reason.stet status --ids— List active finding IDs.stet list— Same as status --ids for active findings.
Each automated test is specified so a coding LLM can implement it. Run all commands from the repository root.
Purpose: Compare two models on the same diff range. Measures finding counts, overlap (exact and fuzzy), and unique findings per model. Used to decide which model to adopt (e.g., qwen3-coder:30b vs qwen2.5-coder:32b).
Prerequisites: Repo has committed changes; clean worktree; both models pulled in Ollama; stet in PATH.
Inputs:
repo_root: Path to Git repo.ref: Baseline ref (e.g.,HEAD~5).model_a,model_b: Ollama model names (e.g.,qwen2.5-coder:32b,qwen3-coder:30b).
Algorithm:
cd repo_root.- If session exists (e.g.,
stet statusexits 0), runstet finish. Ifstet statusexits 1 with "No active session", skip. export STET_MODEL=model_a; runstet start ref --json; capture stdout; parse JSON; extractfindings; save tofindings_a.json. Redirect stderr to /dev/null or log.stet finish.export STET_MODEL=model_b; runstet start ref --json; capture stdout; parse JSON; extractfindings; save tofindings_b.json.- Compute:
count_a,count_b;overlap_exact= findings with samefileandline(orrange) in both;overlap_fuzzy= findings with samefileand similarmessage(e.g., normalized substring or cosine);unique_a,unique_b= findings only in A or B. - Output comparison JSON.
Output format:
{
"model_a": "qwen2.5-coder:32b",
"model_b": "qwen3-coder:30b",
"ref": "HEAD~5",
"count_a": 12,
"count_b": 10,
"overlap_exact": 5,
"overlap_fuzzy": 6,
"unique_a": 6,
"unique_b": 4
}Pass/fail: Report-only.
Purpose: Compute actionability rate and per-reason breakdown from .review/history.jsonl. Actionability = share of findings not dismissed. Used to track how often users find findings useful over time.
Prerequisites: .review/history.jsonl exists and has records from finished sessions.
Inputs:
state_dir: Path to.review/(default:repo_root/.review).history_path:state_dir/history.jsonl.
Algorithm:
- Read
history_pathline by line. - For each line: parse JSON as history Record. Extract
review_output(array of findings) anduser_action.dismissed_ids(array of strings). Optionally extractuser_action.dismissals[]for per-findingreason. - For each Record:
total = len(review_output);dismissed = len(user_action.dismissed_ids);actionability = 1 - dismissed/totalwhen total > 0. - Aggregate: mean actionability, total findings, total dismissed; per-reason counts from
dismissals[].reason(false_positive, already_correct, wrong_suggestion, out_of_scope). - Note:
history.jsonldoes not store model name; to compare models, tag sessions externally (e.g., log model in a separate file keyed bydiff_refor timestamp).
Output format:
{
"records_processed": 20,
"total_findings": 150,
"total_dismissed": 45,
"actionability_rate": 0.7,
"reason_breakdown": {
"false_positive": 25,
"already_correct": 10,
"wrong_suggestion": 7,
"out_of_scope": 3
}
}Pass/fail: Report-only.
The stet stats quality command implements this; use it for ongoing tracking.
Purpose: Measure wall-clock time per run and optionally per-hunk. Used to compare model speed and estimate review duration.
Prerequisites: Repo with non-empty diff; Ollama running; model pulled.
Inputs:
repo_root,ref(e.g.,HEAD~3).model: Ollama model name.
Algorithm:
cd repo_root; ensure clean session (stet finishif needed).- Run
stet start ref --jsonwithSTET_MODEL=model; capture stderr and measure wall-clock (e.g.,timecommand or process start/end timestamps). - Parse stderr for lines matching "Reviewing hunk N/M" to infer total hunks M and per-hunk timing if timestamps are available. Alternatively, use
--streamand parse NDJSON; record timestamp of firstfindingand lastdoneto compute duration. - Compute:
total_seconds,hunks_reviewed(from progress messages or finding count proxy),seconds_per_hunk = total_seconds / hunks_reviewedwhen hunks > 0.
Output format:
{
"model": "qwen3-coder:30b",
"ref": "HEAD~3",
"total_seconds": 120.5,
"hunks_reviewed": 8,
"seconds_per_hunk": 15.06
}Pass/fail: Report-only.
Purpose: Measure consistency of findings across multiple runs of the same model on the same diff. Low consistency suggests high variance; useful before comparing models.
Prerequisites: Repo with committed diff; Ollama running; model pulled.
Inputs:
repo_root,ref.model: Ollama model name.runs: Number of runs (e.g., 3).- Use
STET_TEMPERATURE=0for deterministic sampling.
Algorithm:
- For i in 1..runs:
stet finishif needed;STET_MODEL=model STET_TEMPERATURE=0 stet start ref --json; parse stdout; save findings tofindings_i.json;stet finish. - Build sets of finding keys: e.g.,
(file, line, message_normalized)oridif stable. - Compute Jaccard similarity between each pair of runs:
|A ∩ B| / |A ∪ B|. Report mean and min Jaccard across pairs. - Optionally: count findings that appear in all runs vs only some runs.
Output format:
{
"model": "qwen3-coder:30b",
"runs": 3,
"jaccard_mean": 0.85,
"jaccard_min": 0.78,
"findings_in_all_runs": 6,
"findings_in_some_runs": 4
}Pass/fail: Report-only.
Purpose: Compare the distribution of findings by category and severity across runs or models. Identifies if a model over-flags certain categories (e.g., maintainability) or under-flags others (e.g., security).
Prerequisites: Findings JSON from one or more runs (e.g., from stet start --json or saved output).
Inputs:
- One or more JSON files or objects with
findingsarray.
Algorithm:
- Parse each findings array.
- For each finding: increment counter for
finding.categoryandfinding.severity. - Output counts per category and per severity; optionally normalized percentages.
Output format:
{
"source": "findings.json",
"by_category": {
"security": 2,
"correctness": 5,
"maintainability": 8,
"best_practice": 3
},
"by_severity": {
"error": 1,
"warning": 6,
"info": 10,
"nitpick": 1
}
}Pass/fail: Report-only.
Purpose: Verify that the same hunk produces the same finding ID across runs. Important for dismissals and history to remain stable.
Prerequisites: Repo with committed diff; STET_TEMPERATURE=0 for deterministic output.
Inputs:
repo_root,ref,model.
Algorithm:
- Run
stet start ref --jsontwice with same model and temperature 0;stet finishbetween runs. - Parse both outputs; build maps
file:line -> [(id, message), ...](or useidas key if one finding per location). - For matching file:line (and optionally message): assert
idis identical. Report any mismatches.
Output format:
{
"stable": true,
"mismatches": [],
"total_findings_run1": 5,
"total_findings_run2": 5
}Pass/fail: Fail if any id differs for same file:line+message.
Purpose: Ensure --dry-run produces valid JSON conforming to the findings schema. Used in CI when Ollama is not available.
Prerequisites: Repo with at least one hunk in diff; stet in PATH. Ollama not required.
Inputs:
repo_root,ref(e.g.,HEAD~1).
Algorithm:
stet finishif session exists.- Run
stet start ref --dry-run --json; capture stdout; expect exit 0. - Parse JSON; assert top-level has
findingsarray. - For each finding: assert required fields present (
file,severity,category,confidence,message). Assertseverityin allowed set;categoryin allowed set;confidencein [0, 1]. - If diff has hunks: assert
findingsis non-empty (dry-run emits one finding per hunk typically).
Output format: Pass/fail plus optional schema validation report.
Pass/fail: Fail if JSON invalid, schema violated, or findings empty when hunks exist.
Purpose: Run the model N times and merge findings (union by file:line or by semantic similarity). Evaluates whether aggregation boosts recall (as in SWR-Bench). Optionally compare against ground-truth for precision/recall.
Prerequisites: Repo with diff; Ollama running; optional: ground-truth JSON with expected findings (file, line, message or id).
Inputs:
repo_root,ref,model,runs(e.g., 3).- Optional:
ground_truth.jsonwith{"expected": [{"file": "...", "line": N, "message": "..."}]}.
Algorithm:
- Run
stet start ref --jsonN times with same model; collect all findings. - Merge: union by
(file, line)or by normalized message similarity to deduplicate. - If ground truth provided: match each expected item to merged findings (file:line match or message similarity); compute TP, FP, FN; precision = TP/(TP+FP), recall = TP/(TP+FN), F1.
- Output merged count, and if ground truth: precision, recall, F1.
Output format:
{
"runs": 3,
"merged_findings_count": 15,
"single_run_avg_count": 10,
"ground_truth_provided": true,
"precision": 0.8,
"recall": 0.75,
"f1": 0.77
}Pass/fail: Report-only; optional fail if F1 below threshold when ground truth provided.
Purpose: Infer the context level (diff, file, repo) required for each finding. Heuristic only; supports analysis of where models need more context.
Prerequisites: Findings JSON. Heuristic may be inaccurate; document limitations.
Inputs:
- Findings array; optionally the diff/hunk metadata (which files were in the diff).
Algorithm:
- For each finding: if
fileis not in the list of files changed in the diff, tag asrepo(cross-file). Else if the finding references symbols or lines outside the changed hunk (requires parsing message or suggestion), tag asfile. Else tag asdiff. - Simpler heuristic: all findings in changed files →
diff; findings in other files →repo. Defaultfileif unclear. - Output counts per context level.
Output format:
{
"by_context_level": {
"diff": 10,
"file": 3,
"repo": 2
},
"limitation": "Heuristic; actual context level may differ"
}Pass/fail: Report-only.
Purpose: Compare runs with RAG (symbol definitions) enabled vs disabled. Measures impact of RAG on finding count and categories.
Prerequisites: Repo with code that has symbols (e.g., Go, TypeScript); Ollama running.
Inputs:
repo_root,ref,model.
Algorithm:
- Run A:
stet start ref --json --rag-symbol-max-definitions=0(orSTET_RAG_SYMBOL_MAX_DEFINITIONS=0). Save findings tofindings_no_rag.json;stet finish. - Run B:
stet start ref --json(default RAG, 10 definitions). Save findings tofindings_with_rag.json. - Compare: count difference; category distribution difference; optional overlap analysis.
Output format:
{
"without_rag_count": 8,
"with_rag_count": 10,
"category_diff": {
"correctness": {"without": 2, "with": 4},
"maintainability": {"without": 5, "with": 5}
}
}Pass/fail: Report-only.
Each human test includes experiment design, protocol, recording instructions, and analysis steps.
Purpose: Measure precision (share of findings that are actionable) per model, with model identity hidden to reduce bias.
Prerequisites: Findings from two or more models (or runs) on the same diffs; ability to shuffle and anonymize.
Experiment design:
- Select 50–100 findings per model from runs on the same baseline..HEAD.
- Shuffle all findings; remove model identifier; assign each a random ID (e.g., F001, F002).
- Single human (or multiple; compute inter-rater agreement if so) labels each finding.
Step-by-step protocol:
- Export findings from each model run to CSV:
id,file,line,message,suggestion,severity,category. - Combine CSVs; add column
anon_id; remove model column; shuffle rows. - For each finding, open the file at the line and read the code context.
- Label:
actionable|false_positive|wrong_suggestion|out_of_scope. Add optionalnotes. - Record labels in a spreadsheet with
anon_idand label. - After all labels collected, map
anon_idback to model (using a separate key file).
What to record: anon_id, label, notes, time_spent_seconds (optional).
Sample size guidance: 50–100 findings per model for a meaningful precision estimate. Fewer if only comparing two models; more if confidence intervals are desired.
Analysis:
- Precision per model = (count labeled
actionable) / (total labeled). - Per-reason breakdown: count of
false_positive,wrong_suggestion,out_of_scope. - If multiple raters: compute agreement (e.g., Cohen's kappa) before aggregating.
Artifacts: CSV with anon_id, label, notes; summary report with precision per model.
Purpose: Create ground truth for precision, recall, and F1. Diffs with known issues; human labels which issues exist; compare Stet output to labels.
Prerequisites: Ability to create or select diffs with known defects (bugs, style issues, security issues).
Experiment design:
- Create 10–30 diffs (or select from real PRs) where you know the true positives: "This diff introduces bug X at file:line" or "This diff has style issue Y."
- For each diff, produce a ground-truth JSON:
[{"file": "...", "line": N, "issue_type": "bug"|"style"|..., "description": "..."}]. - Run Stet on each diff; human matches Stet findings to ground truth (TP/FP/FN).
Step-by-step protocol:
- Create or select diff; document expected issues in ground-truth JSON.
- Run
stet start ref --json(or equivalent for that diff); save findings. - For each Stet finding: is it a TP (matches a ground-truth issue), FP (does not match), or is it a new valid issue? If new valid issue, add to ground truth and treat as TP.
- For each ground-truth issue: was it found by Stet? If not, FN.
- Record TP, FP, FN per diff; aggregate.
What to record: Per diff: diff_id, TP, FP, FN; optionally per-finding match details.
Sample size guidance: 10–30 diffs; 2–10 issues per diff. Balance coverage of categories (security, correctness, style).
Analysis:
- Precision = TP / (TP + FP); Recall = TP / (TP + FN); F1 = 2 * P * R / (P + R).
- Report per diff and aggregate; optionally per category.
Artifacts: Ground-truth JSON files; findings JSON per diff; summary with P, R, F1.
Purpose: Use Stet on the Stet repo; triage findings with stet dismiss; derive actionability from history. Real project, real usage.
Prerequisites: Stet repo; Ollama with model; familiarity with the codebase.
Experiment design:
- Run Stet on recent commits (e.g.,
stet start HEAD~5). - Triage every finding: either fix the issue or
stet dismiss <id> <reason>. - Finish session; analyze history.
Step-by-step protocol:
stet start HEAD~5(or chosen ref).- For each finding: read code; decide: fix in code and commit, or
stet dismiss <id> <reason>with one offalse_positive,already_correct,wrong_suggestion,out_of_scope. - Re-run
stet runafter fixes; repeat until all findings triaged. stet finish.- Run automated test A2 on
.review/history.jsonlto get actionability and reason breakdown. - Optionally: add recurring patterns to review-quality.md curated false-positive table.
What to record: Dismissal reasons; notes on any new false-positive patterns.
Sample size guidance: One full review session; aim for 20+ findings to get meaningful actionability.
Analysis: Actionability rate; per-reason counts; qualitative notes on patterns.
Artifacts: Updated history.jsonl; optional updates to review-quality.md.
Purpose: Compare Stet to another LLM-powered review tool (e.g., RoboRev, Graphite) on the same diffs. Measure overlap and unique value per tool.
Prerequisites: Stet and at least one other tool; same diffs run through both; ability to normalize outputs (file, line, message).
Experiment design:
- Select 5–15 diffs with non-trivial changes.
- Run Stet; export findings.
- Run other tool on same diffs; export findings.
- Normalize to common schema (file, line, message).
- Human labels: for overlap (both tools found similar issue) and unique (only one tool found it); label unique findings as valid or invalid.
Step-by-step protocol:
- For each diff: run Stet, save findings; run other tool, save findings.
- Normalize outputs to (file, line, message) or equivalent.
- Match findings across tools: same file:line and similar message → overlap.
- For unique findings (only Stet or only other tool): human labels valid/invalid.
- Compute: overlap count; unique-Stet count (and how many valid); unique-other count (and how many valid).
What to record: Per finding: tool, file, line, message, overlap_with (other finding id or none), unique_valid (yes/no).
Sample size guidance: 5–15 diffs; 5–30 findings per tool per diff.
Analysis: Overlap rate; precision of unique findings per tool; qualitative comparison.
Artifacts: Normalized findings CSV; overlap matrix; summary report.
Purpose: Check if an LLM can reliably label findings (actionable vs not) in agreement with humans. If yes, LLM-as-judge can scale human evaluation.
Prerequisites: Subset of findings with human labels (e.g., from H1); access to an LLM API (e.g., Claude, GPT) for judging.
Experiment design:
- Take 50–100 findings with human labels from H1 (or similar).
- Send each finding (file, line, message, code snippet) to judge LLM with prompt: "Is this code review finding actionable? Respond: actionable | false_positive | wrong_suggestion | out_of_scope."
- Compare judge labels to human labels.
Step-by-step protocol:
- Export human-labeled findings with labels.
- For each finding: construct prompt with file, line, message, and code context (e.g., 5 lines before/after).
- Call judge LLM; parse response into one of the four labels.
- Record judge label and human label.
- Compute agreement: exact match rate; optionally Cohen's kappa.
What to record: finding_id, human_label, judge_label, match (boolean).
Sample size guidance: 50–100 findings. If agreement < 85%, do not use judge alone for evaluation.
Analysis: Agreement rate; confusion matrix (human vs judge); per-reason accuracy.
Artifacts: Comparison CSV; agreement report.
Purpose: Identify new false-positive patterns from Stet runs and add them to the curated table in review-quality.md for prompt shadowing and optimizer.
Prerequisites: Stet runs that produced findings; access to review-quality.md.
Experiment design:
- Run Stet on one or more repos; collect findings that were dismissed as false_positive or wrong_suggestion.
- Cluster by message pattern or category; identify recurring patterns.
- Add new patterns to the curated table with category, message_pattern, reason, note.
Step-by-step protocol:
- Run Stet; triage findings; record dismissals with reasons.
- Filter to
false_positiveandwrong_suggestion. - Group by similar message (e.g., substring or keyword).
- For each group: decide if it merits a curated entry. If yes, add to review-quality.md table: category, message_pattern, reason, note.
- Follow schema in review-quality.md (see "Known false positive patterns" and "Schema for false positive entries").
What to record: Pattern, reason, example finding, note.
Sample size guidance: Continue until no new patterns emerge from last N sessions (e.g., 5–10).
Analysis: Count of new patterns added; optional reduction in similar future false positives.
Artifacts: Updated review-quality.md.
Purpose: Score the quality of suggested fixes: correct/safe, partial, wrong, or harmful. Complements precision of the finding itself.
Prerequisites: Findings with suggestion field; human can evaluate code changes.
Experiment design:
- Sample 30–50 findings that have a non-empty suggestion.
- Human evaluates each suggestion in context: would applying it fix the issue correctly, partially, or make things worse?
Step-by-step protocol:
- Export findings with
suggestion; filter to non-empty. - For each: read file, line, message, suggestion.
- Label:
correct_safe|partial|wrong|harmful. Optionally add note. - Record in spreadsheet.
What to record: finding_id, suggestion_quality, notes.
Sample size guidance: 30–50 findings with suggestions.
Analysis: Distribution of quality; percentage correct_safe; percentage harmful (critical to minimize).
Artifacts: CSV; summary report.
Purpose: Check if Stet's severity (error, warning, info, nitpick) matches human expectation. High misclassification erodes trust.
Prerequisites: Sample of findings; human can judge appropriate severity.
Experiment design:
- Take 30–50 findings across severities.
- Human labels: for each, is the assigned severity correct, or should it be higher/lower?
Step-by-step protocol:
- Export findings with severity.
- For each: read finding and code context.
- Label:
correct|too_high|too_low. Optionally suggest correct severity. - Record.
What to record: finding_id, assigned_severity, human_verdict, suggested_severity (optional).
Sample size guidance: 30–50 findings; strive for mix of severities.
Analysis: Misclassification rate (too_high + too_low); confusion matrix.
Artifacts: CSV; summary report.
Purpose: Compare models by real-world usage: satisfaction, time-to-triage, perceived usefulness over days of use.
Prerequisites: Two models to compare; developer(s) willing to use each for a period.
Experiment design:
- Use model A for N days (e.g., 5–7); use model B for N days. Counterbalance order (half use A first, half B first if multiple users).
- Track: time spent triaging per session; number of actionable fixes applied; qualitative preference.
Step-by-step protocol:
- Define period length (e.g., 5 days per model).
- Use Stet with model A exclusively for period 1; record after each session: findings count, dismissals, fixes applied, time spent (minutes).
- Switch to model B for period 2; same recording.
- Survey: which model did you prefer? Why? What was different?
- Aggregate metrics; compare.
What to record: Per session: model, findings_count, dismissals_count, fixes_applied, time_minutes. Final: preference, free-form feedback.
Sample size guidance: At least 2–3 sessions per model per user; multiple users improve confidence.
Analysis: Mean time per session; mean actionable rate; preference count; qualitative themes.
Artifacts: Session log; survey responses; summary report.
| ID | Name | Type | Description |
|---|---|---|---|
| A1 | Same-diff model swap | Automated | Compare two models on same diff; counts, overlap, unique |
| A2 | Actionability from history | Automated | Parse history.jsonl; actionability rate, reason breakdown |
| A3 | Latency and throughput | Automated | Wall-clock time, hunks/sec |
| A4 | Repeatability | Automated | Same model N runs; Jaccard similarity |
| A5 | Category/severity distribution | Automated | Counts by category and severity |
| A6 | Finding-ID stability | Automated | Assert IDs stable across runs |
| A7 | Dry-run regression | Automated | Schema validation; CI without Ollama |
| A8 | Multi-run aggregation | Automated | Merge N runs; optional ground-truth precision/recall |
| A9 | Context-level tagging | Automated | Heuristic diff/file/repo tagging |
| A10 | RAG ablation | Automated | Compare RAG on vs off |
| H1 | Blind triage | Human | Label findings; precision per model |
| H2 | Fixture benchmark | Human | Ground truth; precision, recall, F1 |
| H3 | Self-review dogfood | Human | Triage on Stet repo; actionability from history |
| H4 | Cross-tool comparison | Human | Stet vs other tool; overlap, unique value |
| H5 | LLM-as-judge calibration | Human | Compare judge LLM to human labels |
| H6 | Curated FP audit | Human | Add patterns to review-quality.md |
| H7 | Suggestion quality | Human | Score suggestion correctness |
| H8 | Severity calibration | Human | Check severity matches expectation |
| H9 | User preference A/B | Human | Compare models over days of use |