Efficacy Tests for Stet

Status: Living document Parent PRD: docs/PRD.md

This document defines two categories of tests for evaluating Stet's code review quality, model performance, and user experience:

Automated tests — Scriptable tests that a coding LLM can implement without human judgment. Each entry includes inputs, outputs, schemas, CLI invocations, and implementation steps.
Human tests — Experiments that require human judgment, labeling, or subjective assessment. Each entry includes experiment design, a step-by-step protocol, sample size guidance, and analysis steps.

Prerequisites and References

CLI Contract

Output format: Use --output=json or --json for machine-parseable JSON on stdout. See cli-extension-contract.md.
Streaming: Use --stream with --json for NDJSON events (progress, finding, done).
Exit codes: 0 = success, 1 = usage/error, 2 = Ollama unreachable.
JSON shape: {"findings": [...]}; each finding has id, file, line, range, severity, category, confidence, message, suggestion, cursor_uri.

Configuration

Model: STET_MODEL or .review/config.toml model = "...". Default: qwen3-coder:30b. See cli/internal/config/config.go.
RAG: STET_RAG_SYMBOL_MAX_DEFINITIONS (default 10; 0 = disable) or --rag-symbol-max-definitions=0 on stet start / stet run.
Temperature: STET_TEMPERATURE (default 0.2); use 0 for deterministic runs.

Schemas

Finding: cli/internal/findings/finding.go — id, file, line, range, severity, category, confidence, message, suggestion, cursor_uri.
History Record: cli/internal/history/schema.go — diff_ref, review_output, user_action.dismissed_ids, user_action.dismissals[] with finding_id and reason.
Dismissal reasons: false_positive, already_correct, wrong_suggestion, out_of_scope.

State Paths

.review/session.json — Session state (baseline, findings, dismissed_ids).
.review/history.jsonl — One JSON object per line; appended on dismiss and finish.

Commands

stet start [ref] — Start review; default ref is HEAD.
stet run — Incremental re-review.
stet finish — End session, remove worktree.
stet dismiss <id> [reason] — Mark finding as dismissed with optional reason.
stet status --ids — List active finding IDs.
stet list — Same as status --ids for active findings.

Part 1: Automated Tests

Each automated test is specified so a coding LLM can implement it. Run all commands from the repository root.

A1: Same-Diff Model Swap

Purpose: Compare two models on the same diff range. Measures finding counts, overlap (exact and fuzzy), and unique findings per model. Used to decide which model to adopt (e.g., qwen3-coder:30b vs qwen2.5-coder:32b).

Prerequisites: Repo has committed changes; clean worktree; both models pulled in Ollama; stet in PATH.

Inputs:

repo_root: Path to Git repo.
ref: Baseline ref (e.g., HEAD~5).
model_a, model_b: Ollama model names (e.g., qwen2.5-coder:32b, qwen3-coder:30b).

Algorithm:

cd repo_root.
If session exists (e.g., stet status exits 0), run stet finish. If stet status exits 1 with "No active session", skip.
export STET_MODEL=model_a; run stet start ref --json; capture stdout; parse JSON; extract findings; save to findings_a.json. Redirect stderr to /dev/null or log.
stet finish.
export STET_MODEL=model_b; run stet start ref --json; capture stdout; parse JSON; extract findings; save to findings_b.json.
Compute: count_a, count_b; overlap_exact = findings with same file and line (or range) in both; overlap_fuzzy = findings with same file and similar message (e.g., normalized substring or cosine); unique_a, unique_b = findings only in A or B.
Output comparison JSON.

Output format:

{
  "model_a": "qwen2.5-coder:32b",
  "model_b": "qwen3-coder:30b",
  "ref": "HEAD~5",
  "count_a": 12,
  "count_b": 10,
  "overlap_exact": 5,
  "overlap_fuzzy": 6,
  "unique_a": 6,
  "unique_b": 4
}

Pass/fail: Report-only.

A2: Actionability from History

Purpose: Compute actionability rate and per-reason breakdown from .review/history.jsonl. Actionability = share of findings not dismissed. Used to track how often users find findings useful over time.

Prerequisites: .review/history.jsonl exists and has records from finished sessions.

Inputs:

state_dir: Path to .review/ (default: repo_root/.review).
history_path: state_dir/history.jsonl.

Algorithm:

Read history_path line by line.
For each line: parse JSON as history Record. Extract review_output (array of findings) and user_action.dismissed_ids (array of strings). Optionally extract user_action.dismissals[] for per-finding reason.
For each Record: total = len(review_output); dismissed = len(user_action.dismissed_ids); actionability = 1 - dismissed/total when total > 0.
Aggregate: mean actionability, total findings, total dismissed; per-reason counts from dismissals[].reason (false_positive, already_correct, wrong_suggestion, out_of_scope).
Note: history.jsonl does not store model name; to compare models, tag sessions externally (e.g., log model in a separate file keyed by diff_ref or timestamp).

Output format:

{
  "records_processed": 20,
  "total_findings": 150,
  "total_dismissed": 45,
  "actionability_rate": 0.7,
  "reason_breakdown": {
    "false_positive": 25,
    "already_correct": 10,
    "wrong_suggestion": 7,
    "out_of_scope": 3
  }
}

Pass/fail: Report-only.

The stet stats quality command implements this; use it for ongoing tracking.

A3: Latency and Throughput

Purpose: Measure wall-clock time per run and optionally per-hunk. Used to compare model speed and estimate review duration.

Prerequisites: Repo with non-empty diff; Ollama running; model pulled.

Inputs:

repo_root, ref (e.g., HEAD~3).
model: Ollama model name.

Algorithm:

cd repo_root; ensure clean session (stet finish if needed).
Run stet start ref --json with STET_MODEL=model; capture stderr and measure wall-clock (e.g., time command or process start/end timestamps).
Parse stderr for lines matching "Reviewing hunk N/M" to infer total hunks M and per-hunk timing if timestamps are available. Alternatively, use --stream and parse NDJSON; record timestamp of first finding and last done to compute duration.
Compute: total_seconds, hunks_reviewed (from progress messages or finding count proxy), seconds_per_hunk = total_seconds / hunks_reviewed when hunks > 0.

Output format:

{
  "model": "qwen3-coder:30b",
  "ref": "HEAD~3",
  "total_seconds": 120.5,
  "hunks_reviewed": 8,
  "seconds_per_hunk": 15.06
}

Pass/fail: Report-only.

A4: Repeatability

Purpose: Measure consistency of findings across multiple runs of the same model on the same diff. Low consistency suggests high variance; useful before comparing models.

Prerequisites: Repo with committed diff; Ollama running; model pulled.

Inputs:

repo_root, ref.
model: Ollama model name.
runs: Number of runs (e.g., 3).
Use STET_TEMPERATURE=0 for deterministic sampling.

Algorithm:

For i in 1..runs: stet finish if needed; STET_MODEL=model STET_TEMPERATURE=0 stet start ref --json; parse stdout; save findings to findings_i.json; stet finish.
Build sets of finding keys: e.g., (file, line, message_normalized) or id if stable.
Compute Jaccard similarity between each pair of runs: |A ∩ B| / |A ∪ B|. Report mean and min Jaccard across pairs.
Optionally: count findings that appear in all runs vs only some runs.

Output format:

{
  "model": "qwen3-coder:30b",
  "runs": 3,
  "jaccard_mean": 0.85,
  "jaccard_min": 0.78,
  "findings_in_all_runs": 6,
  "findings_in_some_runs": 4
}

Pass/fail: Report-only.

A5: Category and Severity Distribution

Purpose: Compare the distribution of findings by category and severity across runs or models. Identifies if a model over-flags certain categories (e.g., maintainability) or under-flags others (e.g., security).

Prerequisites: Findings JSON from one or more runs (e.g., from stet start --json or saved output).

Inputs:

One or more JSON files or objects with findings array.

Algorithm:

Parse each findings array.
For each finding: increment counter for finding.category and finding.severity.
Output counts per category and per severity; optionally normalized percentages.

Output format:

{
  "source": "findings.json",
  "by_category": {
    "security": 2,
    "correctness": 5,
    "maintainability": 8,
    "best_practice": 3
  },
  "by_severity": {
    "error": 1,
    "warning": 6,
    "info": 10,
    "nitpick": 1
  }
}

Pass/fail: Report-only.

A6: Finding-ID Stability

Purpose: Verify that the same hunk produces the same finding ID across runs. Important for dismissals and history to remain stable.

Prerequisites: Repo with committed diff; STET_TEMPERATURE=0 for deterministic output.

Inputs:

repo_root, ref, model.

Algorithm:

Run stet start ref --json twice with same model and temperature 0; stet finish between runs.
Parse both outputs; build maps file:line -> [(id, message), ...] (or use id as key if one finding per location).
For matching file:line (and optionally message): assert id is identical. Report any mismatches.

Output format:

{
  "stable": true,
  "mismatches": [],
  "total_findings_run1": 5,
  "total_findings_run2": 5
}

Pass/fail: Fail if any id differs for same file:line+message.

A7: Dry-Run Regression

Purpose: Ensure --dry-run produces valid JSON conforming to the findings schema. Used in CI when Ollama is not available.

Prerequisites: Repo with at least one hunk in diff; stet in PATH. Ollama not required.

Inputs:

repo_root, ref (e.g., HEAD~1).

Algorithm:

stet finish if session exists.
Run stet start ref --dry-run --json; capture stdout; expect exit 0.
Parse JSON; assert top-level has findings array.
For each finding: assert required fields present (file, severity, category, confidence, message). Assert severity in allowed set; category in allowed set; confidence in [0, 1].
If diff has hunks: assert findings is non-empty (dry-run emits one finding per hunk typically).

Output format: Pass/fail plus optional schema validation report.

Pass/fail: Fail if JSON invalid, schema violated, or findings empty when hunks exist.

A8: Multi-Run Aggregation

Purpose: Run the model N times and merge findings (union by file:line or by semantic similarity). Evaluates whether aggregation boosts recall (as in SWR-Bench). Optionally compare against ground-truth for precision/recall.

Prerequisites: Repo with diff; Ollama running; optional: ground-truth JSON with expected findings (file, line, message or id).

Inputs:

repo_root, ref, model, runs (e.g., 3).
Optional: ground_truth.json with {"expected": [{"file": "...", "line": N, "message": "..."}]}.

Algorithm:

Run stet start ref --json N times with same model; collect all findings.
Merge: union by (file, line) or by normalized message similarity to deduplicate.
If ground truth provided: match each expected item to merged findings (file:line match or message similarity); compute TP, FP, FN; precision = TP/(TP+FP), recall = TP/(TP+FN), F1.
Output merged count, and if ground truth: precision, recall, F1.

Output format:

{
  "runs": 3,
  "merged_findings_count": 15,
  "single_run_avg_count": 10,
  "ground_truth_provided": true,
  "precision": 0.8,
  "recall": 0.75,
  "f1": 0.77
}

Pass/fail: Report-only; optional fail if F1 below threshold when ground truth provided.

A9: Context-Level Tagging

Purpose: Infer the context level (diff, file, repo) required for each finding. Heuristic only; supports analysis of where models need more context.

Prerequisites: Findings JSON. Heuristic may be inaccurate; document limitations.

Inputs:

Findings array; optionally the diff/hunk metadata (which files were in the diff).

Algorithm:

For each finding: if file is not in the list of files changed in the diff, tag as repo (cross-file). Else if the finding references symbols or lines outside the changed hunk (requires parsing message or suggestion), tag as file. Else tag as diff.
Simpler heuristic: all findings in changed files → diff; findings in other files → repo. Default file if unclear.
Output counts per context level.

Output format:

{
  "by_context_level": {
    "diff": 10,
    "file": 3,
    "repo": 2
  },
  "limitation": "Heuristic; actual context level may differ"
}

Pass/fail: Report-only.

A10: RAG Ablation

Purpose: Compare runs with RAG (symbol definitions) enabled vs disabled. Measures impact of RAG on finding count and categories.

Prerequisites: Repo with code that has symbols (e.g., Go, TypeScript); Ollama running.

Inputs:

repo_root, ref, model.

Algorithm:

Run A: stet start ref --json --rag-symbol-max-definitions=0 (or STET_RAG_SYMBOL_MAX_DEFINITIONS=0). Save findings to findings_no_rag.json; stet finish.
Run B: stet start ref --json (default RAG, 10 definitions). Save findings to findings_with_rag.json.
Compare: count difference; category distribution difference; optional overlap analysis.

Output format:

{
  "without_rag_count": 8,
  "with_rag_count": 10,
  "category_diff": {
    "correctness": {"without": 2, "with": 4},
    "maintainability": {"without": 5, "with": 5}
  }
}

Pass/fail: Report-only.

Part 2: Human Tests

Each human test includes experiment design, protocol, recording instructions, and analysis steps.

H1: Blind Triage Study

Purpose: Measure precision (share of findings that are actionable) per model, with model identity hidden to reduce bias.

Prerequisites: Findings from two or more models (or runs) on the same diffs; ability to shuffle and anonymize.

Experiment design:

Select 50–100 findings per model from runs on the same baseline..HEAD.
Shuffle all findings; remove model identifier; assign each a random ID (e.g., F001, F002).
Single human (or multiple; compute inter-rater agreement if so) labels each finding.

Step-by-step protocol:

Export findings from each model run to CSV: id, file, line, message, suggestion, severity, category.
Combine CSVs; add column anon_id; remove model column; shuffle rows.
For each finding, open the file at the line and read the code context.
Label: actionable | false_positive | wrong_suggestion | out_of_scope. Add optional notes.
Record labels in a spreadsheet with anon_id and label.
After all labels collected, map anon_id back to model (using a separate key file).

What to record: anon_id, label, notes, time_spent_seconds (optional).

Sample size guidance: 50–100 findings per model for a meaningful precision estimate. Fewer if only comparing two models; more if confidence intervals are desired.

Analysis:

Precision per model = (count labeled actionable) / (total labeled).
Per-reason breakdown: count of false_positive, wrong_suggestion, out_of_scope.
If multiple raters: compute agreement (e.g., Cohen's kappa) before aggregating.

Artifacts: CSV with anon_id, label, notes; summary report with precision per model.

H2: Fixture Benchmark

Purpose: Create ground truth for precision, recall, and F1. Diffs with known issues; human labels which issues exist; compare Stet output to labels.

Prerequisites: Ability to create or select diffs with known defects (bugs, style issues, security issues).

Experiment design:

Create 10–30 diffs (or select from real PRs) where you know the true positives: "This diff introduces bug X at file:line" or "This diff has style issue Y."
For each diff, produce a ground-truth JSON: [{"file": "...", "line": N, "issue_type": "bug"|"style"|..., "description": "..."}].
Run Stet on each diff; human matches Stet findings to ground truth (TP/FP/FN).

Step-by-step protocol:

Create or select diff; document expected issues in ground-truth JSON.
Run stet start ref --json (or equivalent for that diff); save findings.
For each Stet finding: is it a TP (matches a ground-truth issue), FP (does not match), or is it a new valid issue? If new valid issue, add to ground truth and treat as TP.
For each ground-truth issue: was it found by Stet? If not, FN.
Record TP, FP, FN per diff; aggregate.

What to record: Per diff: diff_id, TP, FP, FN; optionally per-finding match details.

Sample size guidance: 10–30 diffs; 2–10 issues per diff. Balance coverage of categories (security, correctness, style).

Analysis:

Precision = TP / (TP + FP); Recall = TP / (TP + FN); F1 = 2 * P * R / (P + R).
Report per diff and aggregate; optionally per category.

Artifacts: Ground-truth JSON files; findings JSON per diff; summary with P, R, F1.

H3: Self-Review Dogfood

Purpose: Use Stet on the Stet repo; triage findings with stet dismiss; derive actionability from history. Real project, real usage.

Prerequisites: Stet repo; Ollama with model; familiarity with the codebase.

Experiment design:

Run Stet on recent commits (e.g., stet start HEAD~5).
Triage every finding: either fix the issue or stet dismiss <id> <reason>.
Finish session; analyze history.

Step-by-step protocol:

stet start HEAD~5 (or chosen ref).
For each finding: read code; decide: fix in code and commit, or stet dismiss <id> <reason> with one of false_positive, already_correct, wrong_suggestion, out_of_scope.
Re-run stet run after fixes; repeat until all findings triaged.
stet finish.
Run automated test A2 on .review/history.jsonl to get actionability and reason breakdown.
Optionally: add recurring patterns to review-quality.md curated false-positive table.

What to record: Dismissal reasons; notes on any new false-positive patterns.

Sample size guidance: One full review session; aim for 20+ findings to get meaningful actionability.

Analysis: Actionability rate; per-reason counts; qualitative notes on patterns.

Artifacts: Updated history.jsonl; optional updates to review-quality.md.

H4: Cross-Tool Comparison

Purpose: Compare Stet to another LLM-powered review tool (e.g., RoboRev, Graphite) on the same diffs. Measure overlap and unique value per tool.

Prerequisites: Stet and at least one other tool; same diffs run through both; ability to normalize outputs (file, line, message).

Experiment design:

Select 5–15 diffs with non-trivial changes.
Run Stet; export findings.
Run other tool on same diffs; export findings.
Normalize to common schema (file, line, message).
Human labels: for overlap (both tools found similar issue) and unique (only one tool found it); label unique findings as valid or invalid.

Step-by-step protocol:

For each diff: run Stet, save findings; run other tool, save findings.
Normalize outputs to (file, line, message) or equivalent.
Match findings across tools: same file:line and similar message → overlap.
For unique findings (only Stet or only other tool): human labels valid/invalid.
Compute: overlap count; unique-Stet count (and how many valid); unique-other count (and how many valid).

What to record: Per finding: tool, file, line, message, overlap_with (other finding id or none), unique_valid (yes/no).

Sample size guidance: 5–15 diffs; 5–30 findings per tool per diff.

Analysis: Overlap rate; precision of unique findings per tool; qualitative comparison.

Artifacts: Normalized findings CSV; overlap matrix; summary report.

H5: LLM-as-Judge Calibration

Purpose: Check if an LLM can reliably label findings (actionable vs not) in agreement with humans. If yes, LLM-as-judge can scale human evaluation.

Prerequisites: Subset of findings with human labels (e.g., from H1); access to an LLM API (e.g., Claude, GPT) for judging.

Experiment design:

Take 50–100 findings with human labels from H1 (or similar).
Send each finding (file, line, message, code snippet) to judge LLM with prompt: "Is this code review finding actionable? Respond: actionable | false_positive | wrong_suggestion | out_of_scope."
Compare judge labels to human labels.

Step-by-step protocol:

Export human-labeled findings with labels.
For each finding: construct prompt with file, line, message, and code context (e.g., 5 lines before/after).
Call judge LLM; parse response into one of the four labels.
Record judge label and human label.
Compute agreement: exact match rate; optionally Cohen's kappa.

What to record: finding_id, human_label, judge_label, match (boolean).

Sample size guidance: 50–100 findings. If agreement < 85%, do not use judge alone for evaluation.

Analysis: Agreement rate; confusion matrix (human vs judge); per-reason accuracy.

Artifacts: Comparison CSV; agreement report.

H6: Curated False-Positive Audit

Purpose: Identify new false-positive patterns from Stet runs and add them to the curated table in review-quality.md for prompt shadowing and optimizer.

Prerequisites: Stet runs that produced findings; access to review-quality.md.

Experiment design:

Run Stet on one or more repos; collect findings that were dismissed as false_positive or wrong_suggestion.
Cluster by message pattern or category; identify recurring patterns.
Add new patterns to the curated table with category, message_pattern, reason, note.

Step-by-step protocol:

Run Stet; triage findings; record dismissals with reasons.
Filter to false_positive and wrong_suggestion.
Group by similar message (e.g., substring or keyword).
For each group: decide if it merits a curated entry. If yes, add to review-quality.md table: category, message_pattern, reason, note.
Follow schema in review-quality.md (see "Known false positive patterns" and "Schema for false positive entries").

What to record: Pattern, reason, example finding, note.

Sample size guidance: Continue until no new patterns emerge from last N sessions (e.g., 5–10).

Analysis: Count of new patterns added; optional reduction in similar future false positives.

Artifacts: Updated review-quality.md.

H7: Suggestion Quality Assessment

Purpose: Score the quality of suggested fixes: correct/safe, partial, wrong, or harmful. Complements precision of the finding itself.

Prerequisites: Findings with suggestion field; human can evaluate code changes.

Experiment design:

Sample 30–50 findings that have a non-empty suggestion.
Human evaluates each suggestion in context: would applying it fix the issue correctly, partially, or make things worse?

Step-by-step protocol:

Export findings with suggestion; filter to non-empty.
For each: read file, line, message, suggestion.
Label: correct_safe | partial | wrong | harmful. Optionally add note.
Record in spreadsheet.

What to record: finding_id, suggestion_quality, notes.

Sample size guidance: 30–50 findings with suggestions.

Analysis: Distribution of quality; percentage correct_safe; percentage harmful (critical to minimize).

Artifacts: CSV; summary report.

H8: Severity Calibration

Purpose: Check if Stet's severity (error, warning, info, nitpick) matches human expectation. High misclassification erodes trust.

Prerequisites: Sample of findings; human can judge appropriate severity.

Experiment design:

Take 30–50 findings across severities.
Human labels: for each, is the assigned severity correct, or should it be higher/lower?

Step-by-step protocol:

Export findings with severity.
For each: read finding and code context.
Label: correct | too_high | too_low. Optionally suggest correct severity.
Record.

What to record: finding_id, assigned_severity, human_verdict, suggested_severity (optional).

Sample size guidance: 30–50 findings; strive for mix of severities.

Analysis: Misclassification rate (too_high + too_low); confusion matrix.

Artifacts: CSV; summary report.

H9: User Preference A/B

Purpose: Compare models by real-world usage: satisfaction, time-to-triage, perceived usefulness over days of use.

Prerequisites: Two models to compare; developer(s) willing to use each for a period.

Experiment design:

Use model A for N days (e.g., 5–7); use model B for N days. Counterbalance order (half use A first, half B first if multiple users).
Track: time spent triaging per session; number of actionable fixes applied; qualitative preference.

Step-by-step protocol:

Define period length (e.g., 5 days per model).
Use Stet with model A exclusively for period 1; record after each session: findings count, dismissals, fixes applied, time spent (minutes).
Switch to model B for period 2; same recording.
Survey: which model did you prefer? Why? What was different?
Aggregate metrics; compare.

What to record: Per session: model, findings_count, dismissals_count, fixes_applied, time_minutes. Final: preference, free-form feedback.

Sample size guidance: At least 2–3 sessions per model per user; multiple users improve confidence.

Analysis: Mean time per session; mean actionable rate; preference count; qualitative themes.

Artifacts: Session log; survey responses; summary report.

Quick Reference

ID	Name	Type	Description
A1	Same-diff model swap	Automated	Compare two models on same diff; counts, overlap, unique
A2	Actionability from history	Automated	Parse history.jsonl; actionability rate, reason breakdown
A3	Latency and throughput	Automated	Wall-clock time, hunks/sec
A4	Repeatability	Automated	Same model N runs; Jaccard similarity
A5	Category/severity distribution	Automated	Counts by category and severity
A6	Finding-ID stability	Automated	Assert IDs stable across runs
A7	Dry-run regression	Automated	Schema validation; CI without Ollama
A8	Multi-run aggregation	Automated	Merge N runs; optional ground-truth precision/recall
A9	Context-level tagging	Automated	Heuristic diff/file/repo tagging
A10	RAG ablation	Automated	Compare RAG on vs off
H1	Blind triage	Human	Label findings; precision per model
H2	Fixture benchmark	Human	Ground truth; precision, recall, F1
H3	Self-review dogfood	Human	Triage on Stet repo; actionability from history
H4	Cross-tool comparison	Human	Stet vs other tool; overlap, unique value
H5	LLM-as-judge calibration	Human	Compare judge LLM to human labels
H6	Curated FP audit	Human	Add patterns to review-quality.md
H7	Suggestion quality	Human	Score suggestion correctness
H8	Severity calibration	Human	Check severity matches expectation
H9	User preference A/B	Human	Compare models over days of use

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficacy Tests for Stet

Prerequisites and References

CLI Contract

Configuration

Schemas

State Paths

Commands

Part 1: Automated Tests

A1: Same-Diff Model Swap

A2: Actionability from History

A3: Latency and Throughput

A4: Repeatability

A5: Category and Severity Distribution

A6: Finding-ID Stability

A7: Dry-Run Regression

A8: Multi-Run Aggregation

A9: Context-Level Tagging

A10: RAG Ablation

Part 2: Human Tests

H1: Blind Triage Study

H2: Fixture Benchmark

H3: Self-Review Dogfood

H4: Cross-Tool Comparison

H5: LLM-as-Judge Calibration

H6: Curated False-Positive Audit

H7: Suggestion Quality Assessment

H8: Severity Calibration

H9: User Preference A/B

Quick Reference

FilesExpand file tree

efficacy-tests.md

Latest commit

History

efficacy-tests.md

File metadata and controls

Efficacy Tests for Stet

Prerequisites and References

CLI Contract

Configuration

Schemas

State Paths

Commands

Part 1: Automated Tests

A1: Same-Diff Model Swap

A2: Actionability from History

A3: Latency and Throughput

A4: Repeatability

A5: Category and Severity Distribution

A6: Finding-ID Stability

A7: Dry-Run Regression

A8: Multi-Run Aggregation

A9: Context-Level Tagging

A10: RAG Ablation

Part 2: Human Tests

H1: Blind Triage Study

H2: Fixture Benchmark

H3: Self-Review Dogfood

H4: Cross-Tool Comparison

H5: LLM-as-Judge Calibration

H6: Curated False-Positive Audit

H7: Suggestion Quality Assessment

H8: Severity Calibration

H9: User Preference A/B

Quick Reference