Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading)

## Problem

The eval framework has 95 pytest tests, but it's unclear whether they cover the public API surface that third parties will use. The tests were written during development and may focus on internals rather than the documented interfaces.

## What Needs Testing (Third-Party API Surface)

### Scenario Loading
- [ ] YAML scenario parsing — all field types validated
- [ ] Invalid scenario YAML — clear error messages
- [ ] Custom scenario directory discovery (`--scenario-dir`)
- [ ] Scenario filtering by category (`--category`) and tag (`--tag`)

### Runner
- [ ] Single scenario execution end-to-end
- [ ] Batch execution with pass/fail aggregation
- [ ] Timeout handling — scenario exceeds time limit
- [ ] Resume from checkpoint (`--resume`)
- [ ] Cost tracking accuracy

### Scorecard
- [ ] Score calculation matches documented formula
- [ ] Multi-turn aggregation is correct
- [ ] Comparison output (`--compare`) detects regressions
- [ ] Baseline save/load roundtrip

### Corpus
- [ ] Manifest parsing and validation
- [ ] Missing document handling (SKIPPED_NO_DOCUMENT)
- [ ] Custom corpus directory

### CLI
- [ ] All `gaia eval agent` flags work as documented
- [ ] Error messages are actionable for third-party users
- [ ] Output formats are parseable (JSON, Markdown)

## Acceptance Criteria

- [ ] Public API surface has dedicated test coverage
- [ ] Tests serve as documentation for expected behavior
- [ ] Third-party scenarios can be tested without GAIA internals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading) #673

Problem

What Needs Testing (Third-Party API Surface)

Scenario Loading

Runner

Scorecard

Corpus

CLI

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading) #673

Description

Problem

What Needs Testing (Third-Party API Surface)

Scenario Loading

Runner

Scorecard

Corpus

CLI

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions