Skip to content

Agent Eval: Extensibility — plugin API for custom scenarios, scorers, and document types #671

@kovtcharov

Description

@kovtcharov

Problem

The eval framework works well for GAIA's built-in scenarios but third parties need clean extension points to add their own use cases without modifying framework internals.

What Third Parties Need

  1. Custom scenario directory — Drop YAML files in a user directory (e.g., ~/.gaia/eval/scenarios/) and they're automatically discovered alongside built-in scenarios
  2. Custom corpus directory — Add documents to ~/.gaia/eval/corpus/ with a local manifest
  3. Custom scoring dimensions — Add domain-specific scoring beyond the 7 built-in dimensions (e.g., "medical_accuracy" for healthcare, "code_correctness" for developer tools)
  4. Custom personas — Define new personas beyond the 5 built-in ones
  5. Scenario tags/filters — Tag scenarios with metadata (use_case, difficulty, provider) for selective runs
  6. Result export formats — JSON (exists), CSV, JUnit XML (for CI), HTML report

Architecture

Built-in scenarios: eval/scenarios/*.yaml         (shipped with GAIA)
User scenarios:     ~/.gaia/eval/scenarios/*.yaml  (auto-discovered)

Built-in corpus:    eval/corpus/                   (shipped with GAIA)
User corpus:        ~/.gaia/eval/corpus/           (auto-discovered)

Built-in prompts:   eval/prompts/                  (shipped with GAIA)
User prompts:       ~/.gaia/eval/prompts/          (overrides built-in)

CLI Extensions

gaia eval agent --scenario-dir ~/my-project/eval/scenarios/
gaia eval agent --corpus-dir ~/my-project/eval/corpus/
gaia eval agent --output-format junit  # For CI integration
gaia eval agent --tag healthcare       # Run only tagged scenarios

Acceptance Criteria

  • User scenarios in ~/.gaia/eval/scenarios/ are auto-discovered
  • User corpus documents work with local manifest
  • --scenario-dir and --corpus-dir CLI flags work
  • JUnit XML output for CI integration
  • Scenario tagging and filtering via --tag
  • A third party can add 10 custom scenarios without touching GAIA source code

Metadata

Metadata

Assignees

No one assigned

    Labels

    domain:qualityTests, CI/CD, security, performance, evalsenhancementNew feature or requestevalEvaluation framework changestrack:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions