Skip to content

Clarify supported rubric input path for rubric_based_* metrics #132

@erauner12

Description

@erauner12

Hi! I’m trying to understand the intended public flow for rubric-based metrics such as rubric_based_final_response_quality_v1 and rubric_based_tool_use_quality_v1.

I realize these appear to sit on top of experimental ADK evaluator APIs. When running the final-response rubric evaluator through an internal/repo-owned helper path, I see the expected ADK experimental warnings, for example:

[EXPERIMENTAL] RubricBasedFinalResponseQualityV1Evaluator
[EXPERIMENTAL] RubricBasedEvaluator
[EXPERIMENTAL] LlmAsJudge

In that controlled path, I can construct the metric with build_eval_metric(..., rubrics=[...]) and get rubric-based scoring from rubric_based_final_response_quality_v1. For example, a small calibration run with four reviewed cases produced the expected pass/fail outcomes, and an advisory positive trace scored successfully with score: 1.0.

So my question is less “is this broken?” and more: what is the intended public surface for this capability?

From current main, /api/metrics exposes these metrics and marks them as requiring rubrics. I also see rubrics documented on eval-set cases/invocations, and the internal builder accepts rubrics. What I could not find is the supported API/CLI/MCP/config path for supplying those rubrics when running the metrics.

This looks like it may simply be a gap in the public surface rather than a disagreement in direction: the metric metadata, eval-set docs, and internal RubricsBasedCriterion construction are already present, while the runner/API/config path does not yet appear to pass rubrics through. If that is the right read, I would be interested in helping fill the gap, but wanted to ask for the preferred design before opening a PR.

Questions:

  • Are rubric-based metrics intended to consume rubrics from eval-set case/invocation fields?
  • Is a request/config-level rubric field planned for API/CLI/MCP runs?
  • Would you prefer config-level rubrics, eval-set rubrics, or both?
  • Are these metrics intentionally marked working=false until that public surface is decided?
  • Should users treat build_eval_metric(..., rubrics=...) as internal only for now?

Relevant code/docs I checked:

  • src/agentevals/api/routes.py
  • src/agentevals/builtin_metrics.py
  • src/agentevals/config.py
  • src/agentevals/eval_config_loader.py
  • src/agentevals/cli.py
  • src/agentevals/mcp_server.py
  • docs/eval-set-format.md

The flow I’m hoping to support eventually is:

  • define rubric criteria with IDs/text
  • run rubric_based_final_response_quality_v1 with a configured judge model
  • get overall and ideally per-rubric scoring
  • do this through a supported API/CLI/MCP/config path rather than reaching into internals

I asked a similar question in Discord and wanted to open a tracking issue for the intended public API/config direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions