Skip to content

feat(eval): retrieval-quality eval harness for code_index #868

@itomek

Description

@itomek

Context

Reviewer on #721 asked for testing evidence on large repos and retrieval quality. The PR ships unit tests covering parser correctness and SDK behaviour, but there is no end-to-end retrieval-quality eval.

Goal

A gaia eval code-index subcommand that runs a curated query set against a reference repository (e.g. gaia itself), measures recall@k / MRR, and outputs a report consumable by gaia report.

Acceptance

  • New eval subcommand wired into src/gaia/eval/ alongside existing fix-code / agent evals.
  • Curated query set with known-correct answers committed under tests/fixtures/.
  • Baseline numbers committed to docs/ so regressions are visible.
  • Report renders via gaia report.

References

  • src/gaia/eval/ — existing eval framework
  • docs/reference/eval.mdx — framework docs
  • src/gaia/code_index/sdk.py — SDK under test

Deferred from #721.

Metadata

Metadata

Assignees

No one assigned

    Labels

    code-agentCode agent changesdomain:qualityTests, CI/CD, security, performance, evalsenhancementNew feature or requestevalEvaluation framework changestrack:platformFoundation that both consumer-app and oem-pc tracks consume

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions