Skip to content

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading) #673

@kovtcharov

Description

@kovtcharov

Problem

The eval framework has 95 pytest tests, but it's unclear whether they cover the public API surface that third parties will use. The tests were written during development and may focus on internals rather than the documented interfaces.

What Needs Testing (Third-Party API Surface)

Scenario Loading

  • YAML scenario parsing — all field types validated
  • Invalid scenario YAML — clear error messages
  • Custom scenario directory discovery (--scenario-dir)
  • Scenario filtering by category (--category) and tag (--tag)

Runner

  • Single scenario execution end-to-end
  • Batch execution with pass/fail aggregation
  • Timeout handling — scenario exceeds time limit
  • Resume from checkpoint (--resume)
  • Cost tracking accuracy

Scorecard

  • Score calculation matches documented formula
  • Multi-turn aggregation is correct
  • Comparison output (--compare) detects regressions
  • Baseline save/load roundtrip

Corpus

  • Manifest parsing and validation
  • Missing document handling (SKIPPED_NO_DOCUMENT)
  • Custom corpus directory

CLI

  • All gaia eval agent flags work as documented
  • Error messages are actionable for third-party users
  • Output formats are parseable (JSON, Markdown)

Acceptance Criteria

  • Public API surface has dedicated test coverage
  • Tests serve as documentation for expected behavior
  • Third-party scenarios can be tested without GAIA internals

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain:qualityTests, CI/CD, security, performance, evalsevalEvaluation framework changestestsTest changestrack:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions