wavemill/docs/eval-mode.md at main · timogilvie/wavemill

title	Eval Mode

Use wavemill eval when you want to inspect or export the data that makes Wavemill self-improving. Mill mode records this automatically; the standalone command is for analysis and reporting.

What It Does

Gathers context from completed workflows (Linear issue, PR diff, review comments).
Detects intervention events (manual fixes, review requests, tool failures, rework).
Analyzes PR difficulty (lines changed, files touched, architectural complexity).
Collects outcome metrics (CI status, test coverage, static analysis, review rounds).
Invokes an LLM judge to score the workflow (0.0 - 1.0 scale).
Applies weighted penalties for interventions based on severity.
Generates structured evaluation reports with rationale and score bands.
Persists eval records to disk for aggregate analysis.

Run It

# Auto-detect from most recent wavemill workflow
npx tsx tools/eval-workflow.ts

# Evaluate a specific issue and PR
npx tsx tools/eval-workflow.ts --issue HOK-699 --pr 42

# Override judge model
npx tsx tools/eval-workflow.ts --model claude-opus-4-6

# Specify solution model for tracking
npx tsx tools/eval-workflow.ts --issue HOK-699 --pr 42 --solution-model codex-1

How It Works

1) Context Resolution

The tool resolves workflow context in priority order:

Explicit arguments (--issue, --pr) if provided
Workflow state file (.wavemill/workflow-state.json) for most recent task with PR
Current branch PR using gh pr view

It then fetches:

Linear issue details (identifier, title, description)
PR diff via gh pr diff
Review comments from GitHub API

2) Intervention Detection

The tool analyzes git history and GitHub activity to detect intervention events:

Intervention Type	Detection Method	Severity
Manual commits	Author ≠ agent user	High
Review comments	PR review activity	Medium
Force pushes	`git reflog` analysis	High
CI failures	GitHub checks API	Medium
Tool failures	Agent log analysis	Low
Branch resets	`git reflog` inspection	High

Each intervention has a weighted penalty (configured in .wavemill-config.json).

3) Difficulty Analysis

PR difficulty is assessed across multiple dimensions:

Stratum classification:

Trivial — < 50 LOC, 1-2 files, config/docs only
Simple — < 200 LOC, 3-5 files, single-layer changes
Moderate — < 500 LOC, 6-10 files, cross-layer changes
Complex — 500+ LOC, 10+ files, architectural changes

Signals analyzed:

Lines of code touched
Number of files modified
Cyclomatic complexity delta
Architectural layer changes (UI, API, data)
Test coverage changes

4) Outcome Collection

The tool gathers measurable outcomes:

CI/CD:

All checks passed/failed
Check names and conclusions

Tests:

Tests added (yes/no)
Pass rate if tests exist
Coverage delta

Static Analysis:

Typecheck pass/fail
Lint error delta
Security warnings

Review:

Approvals count
Change requests count
Review rounds
Human review required (yes/no)

Rework:

Agent iterations
Self-review cycles
Tool failures

Delivery:

PR created (yes/no)
Merged (yes/no)
Time to merge

5) LLM Judge Invocation

Context is sent to the LLM judge (default: claude-sonnet-4-5-20250929):

Judge inputs:

Task prompt (issue description)
PR review output (diff + comments)
Intervention summary with weighted penalties
Outcome metrics

Judge outputs:

Score (0.0 - 1.0)
Score band (Excellent / Good / Acceptable / Poor / Failed)
Rationale (why this score)
Intervention flags (additional issues detected)

6) Scoring Formula

base_score = judge_raw_score
intervention_penalty = sum(intervention_weights)
final_score = max(0.0, base_score - intervention_penalty)

7) Persistence

Eval records are appended to .wavemill/eval-records.jsonl with:

Timestamp
Issue ID and PR URL
Score and band
Agent type and model
Intervention count and details
Outcome metrics
Difficulty signals
Task and repo context

Evaluation Criteria

The LLM judge focuses on major issues only — not style or subjective preferences.

Core Quality Dimensions

Dimension	Examples
Correctness	Logic errors, off-by-one bugs, null handling, race conditions
Security	SQL injection, XSS, exposed secrets, missing auth checks
Requirements	Missing planned features, wrong approach, scope creep
Error Handling	Unhandled network failures, missing validation at boundaries
Architecture	Pattern violations, wrong abstraction layer, breaking changes
Testing	Missing tests, inadequate coverage, brittle test design

Intervention Impact

Interventions reduce the score based on severity:

High severity (manual fixes, force pushes) — 0.2-0.3 penalty each
Medium severity (review comments, CI failures) — 0.1-0.15 penalty each
Low severity (tool failures, retries) — 0.05 penalty each

Score Bands

Score	Band	Description
0.9 - 1.0	Excellent	Production-ready, minimal review needed
0.7 - 0.89	Good	Solid work, minor improvements possible
0.5 - 0.69	Acceptable	Functional but needs polish
0.3 - 0.49	Poor	Significant issues, major rework needed
0.0 - 0.29	Failed	Does not meet basic requirements

Output Format

Terminal Output (Human-Readable)

═══════════════════════════════════════════════════════════════
  WORKFLOW EVALUATION
═══════════════════════════════════════════════════════════════

  Issue:  HOK-699
  PR:     https://github.com/org/repo/pull/42
  Agent:  claude
  Model:  claude-sonnet-4-5-20250929
  Time:   127s

  Score:  0.82  ████████░░  Good
          Solid work, minor improvements possible

  Rationale:
  Implementation correctly handles the requirements with proper
  error handling and test coverage. One minor issue with edge
  case handling in the validation logic.

  Interventions: 1
    [MED] review_comment @ 14:23:45
      Manual review comment about error message clarity

  Outcomes:
    Success:   ✓
    CI:        passed (3 checks)
    Tests:     added (95% pass)
    Review:    1 approvals, 0 change requests, 1 rounds
    Rework:    2 iterations
    Delivery:  merged (4h)

═══════════════════════════════════════════════════════════════

JSON Output (Machine-Readable)

When piped to a file or non-TTY, outputs complete JSON record:

{
  "timestamp": "2026-02-27T21:45:00Z",
  "issueId": "HOK-699",
  "prUrl": "https://github.com/org/repo/pull/42",
  "score": 0.82,
  "scoreBand": "Good",
  "rationale": "...",
  "interventionRequired": true,
  "interventionCount": 1,
  "interventions": [...],
  "outcomes": {...},
  "difficultyBand": "moderate",
  "taskContext": {...},
  "repoContext": {...},
  "modelId": "claude-sonnet-4-5-20250929",
  "agentType": "claude",
  "timeSeconds": 127
}

Configuration

Eval settings live in .wavemill-config.json:

{
  "eval": {
    "judge": {
      "model": "claude-sonnet-4-5-20250929",
      "provider": "claude-cli"
    },
    "interventions": {
      "penalties": {
        "manual_commit": 0.25,
        "review_comment": 0.10,
        "force_push": 0.30,
        "ci_failure": 0.15,
        "tool_failure": 0.05,
        "branch_reset": 0.30
      }
    }
  }
}

Setting	Default	Description
`eval.judge.model`	`claude-sonnet-4-5-20250929`	LLM model for evaluation
`eval.judge.provider`	`claude-cli`	Provider (`claude-cli` or `anthropic`)
`eval.interventions.penalties.*`	See above	Penalty weights per intervention type

Command-Line Options

npx tsx tools/eval-workflow.ts [options]

Options:
  --issue ID              Linear issue identifier (e.g., HOK-123)
  --pr NUMBER             GitHub PR number
  --model ID              Override eval model
  --solution-model ID     Model that produced the solution (for tracking)
  --agent TYPE            Agent type: claude or codex (default: claude)
  --repo-dir DIR          Repository directory (default: current)
  --routing-decision JSON Routing decision metadata (JSON string)
  --help, -h              Show help message

Exit Codes

Code	Meaning
`0`	Evaluation completed successfully
`1`	Error occurred during evaluation

Key Files

File	Purpose
`tools/eval-workflow.ts`	CLI tool — orchestrates evaluation workflow
`shared/lib/eval.js`	Core judge invocation logic
`shared/lib/eval-schema.ts`	Score bands, validation schemas
`shared/lib/intervention-detector.ts`	Detects and weights intervention events
`shared/lib/difficulty-analyzer.ts`	Analyzes PR complexity and difficulty
`shared/lib/outcome-collectors.ts`	Collects CI, test, review, delivery metrics
`.wavemill/eval-records.jsonl`	Persisted evaluation records
`.wavemill-config.json`	Evaluation configuration and penalties

Multi-Repo Aggregation

Overview

Combine eval records from multiple repositories into a single JSONL file for cross-repo analysis and ML training using wavemill eval aggregate.

Quick Start

# Auto-discover all wavemill repos in /path/to/repos
npx tsx tools/aggregate-evals.ts --discover /path/to/repos

# Aggregate specific repos
npx tsx tools/aggregate-evals.ts --repos ~/proj1 ~/proj2 ~/proj3

# Use configured repos from .wavemill-config.json
npx tsx tools/aggregate-evals.ts

Features

Auto-Discovery: Recursively finds all .wavemill/evals/evals.jsonl files
Hash-Based Deduplication: Removes duplicate evaluations (same issue + PR + model)
Cost Aggregation: Summarizes workflow costs across all repos
Sorting: Orders records by timestamp for reproducibility
Flexible Output: JSONL format, one record per line

Configuration

Add to .wavemill-config.json:

{
  "eval": {
    "aggregation": {
      "repos": ["/path/to/repo1", "/path/to/repo2"],
      "outputPath": ".wavemill/evals/aggregated-evals.jsonl"
    }
  }
}

Use Cases

Cross-Repo Metrics — Analyze success rates, score distributions across multiple projects
Cost Attribution — Track total spend by project, model, and time period
ML Training — Aggregated data for DSPy evaluation model training
Quality Trends — Identify patterns in task difficulty and agent performance

See eval-aggregate for detailed documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What It Does

Run It

How It Works

1) Context Resolution

2) Intervention Detection

3) Difficulty Analysis

4) Outcome Collection

5) LLM Judge Invocation

6) Scoring Formula

7) Persistence

Evaluation Criteria

Core Quality Dimensions

Intervention Impact

Score Bands

Output Format

Terminal Output (Human-Readable)

JSON Output (Machine-Readable)

Configuration

Command-Line Options

Exit Codes

Key Files

Multi-Repo Aggregation

Overview

Quick Start

Features

Configuration

Use Cases

See Also

FilesExpand file tree

eval-mode.md

Latest commit

History

eval-mode.md

File metadata and controls

What It Does

Run It

How It Works

1) Context Resolution

2) Intervention Detection

3) Difficulty Analysis

4) Outcome Collection

5) LLM Judge Invocation

6) Scoring Formula

7) Persistence

Evaluation Criteria

Core Quality Dimensions

Intervention Impact

Score Bands

Output Format

Terminal Output (Human-Readable)

JSON Output (Machine-Readable)

Configuration

Command-Line Options

Exit Codes

Key Files

Multi-Repo Aggregation

Overview

Quick Start

Features

Configuration

Use Cases

See Also