Skip to content

[P0] evaluation: Pydantic models — EvalConfig, EvalCase, EvalBaseline, DriftReport #897

@Cataldir

Description

@Cataldir

Business context

All evaluation components (dataset loader, evaluators, drift detector, CI gate) need shared type-safe models. This is the foundational schema that all other evaluation engine tasks depend on.

User outcome: Consistent, validated data structures across the entire evaluation pipeline.
Business metric: Zero schema-mismatch bugs during integration of evaluation components.

Scope

  • In scope: holiday_peak_lib.evaluation.models module with EvalConfig, EvalCase, EvalBaseline, DriftReport, EvaluationDriftSignal Pydantic models
  • Out of scope: Dataset loading logic; evaluation execution logic; Foundry SDK interaction

Current behavior evidence

  • lib/src/holiday_peak_lib/evaluation/eval_runner.py — existing EvaluationRunResult dataclass (frozen) that must remain backward-compatible
  • lib/src/holiday_peak_lib/self_healing/models.pyFailureSignal interface that EvaluationDriftSignal must be compatible with
  • lib/src/holiday_peak_lib/evaluation/__init__.py — current exports that must be preserved

Acceptance criteria

  • Models module at lib/src/holiday_peak_lib/evaluation/models.py
  • All models use Pydantic v2 with model_config = ConfigDict(frozen=True)
  • EvalConfig includes: agent_name: str, evaluators: list[str], dataset_path: str, baseline_path: str | None, thresholds: dict[str, float], model_targets: list[str]
  • EvalCase includes: query: str, expected_behavior: str, expected_model_tier: str | None, context: str | None, ground_truth: str | None
  • EvalBaseline stores historical metric snapshots with timestamps
  • DriftReport includes: breached_thresholds: list[str], severity: Literal["warning", "critical"], drift_metrics: dict[str, float]
  • EvaluationDriftSignal compatible with self-healing FailureSignal interface
  • Backward-compatible with existing EvaluationRunResult dataclass
  • 100% unit test coverage in lib/tests/test_evaluation_models.py
  • evaluation/__init__.py exports updated

Risks and dependencies

  • Risk: Schema changes later require migration — freeze models early, iterate via versioned additive fields
  • Dependency: E-01 (ADR-038 accepted — provides schema specification)

ADR impact

  • ADR-038 (schema section)
  • No impact on other ADRs

BPMN process

%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#FFB3BA',
  'primaryTextColor':'#000',
  'primaryBorderColor':'#FF8B94',
  'lineColor':'#BAE1FF',
  'secondaryColor':'#BAE1FF',
  'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
  A[Review ADR-038 Schema Spec] --> B[Define Pydantic Models]
  B --> C[Verify FailureSignal Compat]
  C --> D[Write Unit Tests]
  D --> E[Update __init__ Exports]
  E --> F[PR Validation]
  F --> G[Merge and Monitor]
Loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:evaluationEvaluation engine componentsarea:libShared library workenhancementNew feature or requestevaluationAgent evaluation enginepythonPython code quality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions