You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All evaluation components (dataset loader, evaluators, drift detector, CI gate) need shared type-safe models. This is the foundational schema that all other evaluation engine tasks depend on.
User outcome: Consistent, validated data structures across the entire evaluation pipeline. Business metric: Zero schema-mismatch bugs during integration of evaluation components.
Scope
In scope: holiday_peak_lib.evaluation.models module with EvalConfig, EvalCase, EvalBaseline, DriftReport, EvaluationDriftSignal Pydantic models
Out of scope: Dataset loading logic; evaluation execution logic; Foundry SDK interaction
Current behavior evidence
lib/src/holiday_peak_lib/evaluation/eval_runner.py — existing EvaluationRunResult dataclass (frozen) that must remain backward-compatible
lib/src/holiday_peak_lib/self_healing/models.py — FailureSignal interface that EvaluationDriftSignal must be compatible with
lib/src/holiday_peak_lib/evaluation/__init__.py — current exports that must be preserved
Acceptance criteria
Models module at lib/src/holiday_peak_lib/evaluation/models.py
All models use Pydantic v2 with model_config = ConfigDict(frozen=True)
Business context
All evaluation components (dataset loader, evaluators, drift detector, CI gate) need shared type-safe models. This is the foundational schema that all other evaluation engine tasks depend on.
User outcome: Consistent, validated data structures across the entire evaluation pipeline.
Business metric: Zero schema-mismatch bugs during integration of evaluation components.
Scope
holiday_peak_lib.evaluation.modelsmodule withEvalConfig,EvalCase,EvalBaseline,DriftReport,EvaluationDriftSignalPydantic modelsCurrent behavior evidence
lib/src/holiday_peak_lib/evaluation/eval_runner.py— existingEvaluationRunResultdataclass (frozen) that must remain backward-compatiblelib/src/holiday_peak_lib/self_healing/models.py—FailureSignalinterface thatEvaluationDriftSignalmust be compatible withlib/src/holiday_peak_lib/evaluation/__init__.py— current exports that must be preservedAcceptance criteria
lib/src/holiday_peak_lib/evaluation/models.pymodel_config = ConfigDict(frozen=True)EvalConfigincludes:agent_name: str,evaluators: list[str],dataset_path: str,baseline_path: str | None,thresholds: dict[str, float],model_targets: list[str]EvalCaseincludes:query: str,expected_behavior: str,expected_model_tier: str | None,context: str | None,ground_truth: str | NoneEvalBaselinestores historical metric snapshots with timestampsDriftReportincludes:breached_thresholds: list[str],severity: Literal["warning", "critical"],drift_metrics: dict[str, float]EvaluationDriftSignalcompatible with self-healingFailureSignalinterfaceEvaluationRunResultdataclasslib/tests/test_evaluation_models.pyevaluation/__init__.pyexports updatedRisks and dependencies
ADR impact
BPMN process
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#FFB3BA', 'primaryTextColor':'#000', 'primaryBorderColor':'#FF8B94', 'lineColor':'#BAE1FF', 'secondaryColor':'#BAE1FF', 'tertiaryColor':'#FFFFFF' }}}%% flowchart LR A[Review ADR-038 Schema Spec] --> B[Define Pydantic Models] B --> C[Verify FailureSignal Compat] C --> D[Write Unit Tests] D --> E[Update __init__ Exports] E --> F[PR Validation] F --> G[Merge and Monitor]