| title | Eval Mode |
|---|
Use wavemill eval when you want to inspect or export the data that makes Wavemill self-improving. Mill mode records this automatically; the standalone command is for analysis and reporting.
- Gathers context from completed workflows (Linear issue, PR diff, review comments).
- Detects intervention events (manual fixes, review requests, tool failures, rework).
- Analyzes PR difficulty (lines changed, files touched, architectural complexity).
- Collects outcome metrics (CI status, test coverage, static analysis, review rounds).
- Invokes an LLM judge to score the workflow (0.0 - 1.0 scale).
- Applies weighted penalties for interventions based on severity.
- Generates structured evaluation reports with rationale and score bands.
- Persists eval records to disk for aggregate analysis.
# Auto-detect from most recent wavemill workflow
npx tsx tools/eval-workflow.ts
# Evaluate a specific issue and PR
npx tsx tools/eval-workflow.ts --issue HOK-699 --pr 42
# Override judge model
npx tsx tools/eval-workflow.ts --model claude-opus-4-6
# Specify solution model for tracking
npx tsx tools/eval-workflow.ts --issue HOK-699 --pr 42 --solution-model codex-1The tool resolves workflow context in priority order:
- Explicit arguments (
--issue,--pr) if provided - Workflow state file (
.wavemill/workflow-state.json) for most recent task with PR - Current branch PR using
gh pr view
It then fetches:
- Linear issue details (identifier, title, description)
- PR diff via
gh pr diff - Review comments from GitHub API
The tool analyzes git history and GitHub activity to detect intervention events:
| Intervention Type | Detection Method | Severity |
|---|---|---|
| Manual commits | Author ≠ agent user | High |
| Review comments | PR review activity | Medium |
| Force pushes | git reflog analysis |
High |
| CI failures | GitHub checks API | Medium |
| Tool failures | Agent log analysis | Low |
| Branch resets | git reflog inspection |
High |
Each intervention has a weighted penalty (configured in .wavemill-config.json).
PR difficulty is assessed across multiple dimensions:
Stratum classification:
- Trivial — < 50 LOC, 1-2 files, config/docs only
- Simple — < 200 LOC, 3-5 files, single-layer changes
- Moderate — < 500 LOC, 6-10 files, cross-layer changes
- Complex — 500+ LOC, 10+ files, architectural changes
Signals analyzed:
- Lines of code touched
- Number of files modified
- Cyclomatic complexity delta
- Architectural layer changes (UI, API, data)
- Test coverage changes
The tool gathers measurable outcomes:
CI/CD:
- All checks passed/failed
- Check names and conclusions
Tests:
- Tests added (yes/no)
- Pass rate if tests exist
- Coverage delta
Static Analysis:
- Typecheck pass/fail
- Lint error delta
- Security warnings
Review:
- Approvals count
- Change requests count
- Review rounds
- Human review required (yes/no)
Rework:
- Agent iterations
- Self-review cycles
- Tool failures
Delivery:
- PR created (yes/no)
- Merged (yes/no)
- Time to merge
Context is sent to the LLM judge (default: claude-sonnet-4-5-20250929):
Judge inputs:
- Task prompt (issue description)
- PR review output (diff + comments)
- Intervention summary with weighted penalties
- Outcome metrics
Judge outputs:
- Score (0.0 - 1.0)
- Score band (Excellent / Good / Acceptable / Poor / Failed)
- Rationale (why this score)
- Intervention flags (additional issues detected)
base_score = judge_raw_score
intervention_penalty = sum(intervention_weights)
final_score = max(0.0, base_score - intervention_penalty)
Eval records are appended to .wavemill/eval-records.jsonl with:
- Timestamp
- Issue ID and PR URL
- Score and band
- Agent type and model
- Intervention count and details
- Outcome metrics
- Difficulty signals
- Task and repo context
The LLM judge focuses on major issues only — not style or subjective preferences.
| Dimension | Examples |
|---|---|
| Correctness | Logic errors, off-by-one bugs, null handling, race conditions |
| Security | SQL injection, XSS, exposed secrets, missing auth checks |
| Requirements | Missing planned features, wrong approach, scope creep |
| Error Handling | Unhandled network failures, missing validation at boundaries |
| Architecture | Pattern violations, wrong abstraction layer, breaking changes |
| Testing | Missing tests, inadequate coverage, brittle test design |
Interventions reduce the score based on severity:
- High severity (manual fixes, force pushes) — 0.2-0.3 penalty each
- Medium severity (review comments, CI failures) — 0.1-0.15 penalty each
- Low severity (tool failures, retries) — 0.05 penalty each
| Score | Band | Description |
|---|---|---|
| 0.9 - 1.0 | Excellent | Production-ready, minimal review needed |
| 0.7 - 0.89 | Good | Solid work, minor improvements possible |
| 0.5 - 0.69 | Acceptable | Functional but needs polish |
| 0.3 - 0.49 | Poor | Significant issues, major rework needed |
| 0.0 - 0.29 | Failed | Does not meet basic requirements |
═══════════════════════════════════════════════════════════════
WORKFLOW EVALUATION
═══════════════════════════════════════════════════════════════
Issue: HOK-699
PR: https://github.com/org/repo/pull/42
Agent: claude
Model: claude-sonnet-4-5-20250929
Time: 127s
Score: 0.82 ████████░░ Good
Solid work, minor improvements possible
Rationale:
Implementation correctly handles the requirements with proper
error handling and test coverage. One minor issue with edge
case handling in the validation logic.
Interventions: 1
[MED] review_comment @ 14:23:45
Manual review comment about error message clarity
Outcomes:
Success: ✓
CI: passed (3 checks)
Tests: added (95% pass)
Review: 1 approvals, 0 change requests, 1 rounds
Rework: 2 iterations
Delivery: merged (4h)
═══════════════════════════════════════════════════════════════
When piped to a file or non-TTY, outputs complete JSON record:
{
"timestamp": "2026-02-27T21:45:00Z",
"issueId": "HOK-699",
"prUrl": "https://github.com/org/repo/pull/42",
"score": 0.82,
"scoreBand": "Good",
"rationale": "...",
"interventionRequired": true,
"interventionCount": 1,
"interventions": [...],
"outcomes": {...},
"difficultyBand": "moderate",
"taskContext": {...},
"repoContext": {...},
"modelId": "claude-sonnet-4-5-20250929",
"agentType": "claude",
"timeSeconds": 127
}Eval settings live in .wavemill-config.json:
{
"eval": {
"judge": {
"model": "claude-sonnet-4-5-20250929",
"provider": "claude-cli"
},
"interventions": {
"penalties": {
"manual_commit": 0.25,
"review_comment": 0.10,
"force_push": 0.30,
"ci_failure": 0.15,
"tool_failure": 0.05,
"branch_reset": 0.30
}
}
}
}| Setting | Default | Description |
|---|---|---|
eval.judge.model |
claude-sonnet-4-5-20250929 |
LLM model for evaluation |
eval.judge.provider |
claude-cli |
Provider (claude-cli or anthropic) |
eval.interventions.penalties.* |
See above | Penalty weights per intervention type |
npx tsx tools/eval-workflow.ts [options]
Options:
--issue ID Linear issue identifier (e.g., HOK-123)
--pr NUMBER GitHub PR number
--model ID Override eval model
--solution-model ID Model that produced the solution (for tracking)
--agent TYPE Agent type: claude or codex (default: claude)
--repo-dir DIR Repository directory (default: current)
--routing-decision JSON Routing decision metadata (JSON string)
--help, -h Show help message| Code | Meaning |
|---|---|
0 |
Evaluation completed successfully |
1 |
Error occurred during evaluation |
| File | Purpose |
|---|---|
tools/eval-workflow.ts |
CLI tool — orchestrates evaluation workflow |
shared/lib/eval.js |
Core judge invocation logic |
shared/lib/eval-schema.ts |
Score bands, validation schemas |
shared/lib/intervention-detector.ts |
Detects and weights intervention events |
shared/lib/difficulty-analyzer.ts |
Analyzes PR complexity and difficulty |
shared/lib/outcome-collectors.ts |
Collects CI, test, review, delivery metrics |
.wavemill/eval-records.jsonl |
Persisted evaluation records |
.wavemill-config.json |
Evaluation configuration and penalties |
Combine eval records from multiple repositories into a single JSONL file for cross-repo analysis and ML training using wavemill eval aggregate.
# Auto-discover all wavemill repos in /path/to/repos
npx tsx tools/aggregate-evals.ts --discover /path/to/repos
# Aggregate specific repos
npx tsx tools/aggregate-evals.ts --repos ~/proj1 ~/proj2 ~/proj3
# Use configured repos from .wavemill-config.json
npx tsx tools/aggregate-evals.ts- Auto-Discovery: Recursively finds all
.wavemill/evals/evals.jsonlfiles - Hash-Based Deduplication: Removes duplicate evaluations (same issue + PR + model)
- Cost Aggregation: Summarizes workflow costs across all repos
- Sorting: Orders records by timestamp for reproducibility
- Flexible Output: JSONL format, one record per line
Add to .wavemill-config.json:
{
"eval": {
"aggregation": {
"repos": ["/path/to/repo1", "/path/to/repo2"],
"outputPath": ".wavemill/evals/aggregated-evals.jsonl"
}
}
}- Cross-Repo Metrics — Analyze success rates, score distributions across multiple projects
- Cost Attribution — Track total spend by project, model, and time period
- ML Training — Aggregated data for DSPy evaluation model training
- Quality Trends — Identify patterns in task difficulty and agent performance
See eval-aggregate for detailed documentation.
- Mill Mode — autonomous parallel backlog processing (includes eval in post-completion hooks)
- Review Mode — LLM-powered code review (runs before eval)
- Feature Workflow — guided single-issue execution
- Expand Mode — batch expand issues into task packets
- Eval Aggregation — multi-repo eval log aggregation
- Troubleshooting — common issues and fixes