[P0] evaluation: Pydantic models — EvalConfig, EvalCase, EvalBaseline, DriftReport

## Business context

All evaluation components (dataset loader, evaluators, drift detector, CI gate) need shared type-safe models. This is the foundational schema that all other evaluation engine tasks depend on.

**User outcome:** Consistent, validated data structures across the entire evaluation pipeline.
**Business metric:** Zero schema-mismatch bugs during integration of evaluation components.

## Scope

- In scope: `holiday_peak_lib.evaluation.models` module with `EvalConfig`, `EvalCase`, `EvalBaseline`, `DriftReport`, `EvaluationDriftSignal` Pydantic models
- Out of scope: Dataset loading logic; evaluation execution logic; Foundry SDK interaction

## Current behavior evidence

- `lib/src/holiday_peak_lib/evaluation/eval_runner.py` — existing `EvaluationRunResult` dataclass (frozen) that must remain backward-compatible
- `lib/src/holiday_peak_lib/self_healing/models.py` — `FailureSignal` interface that `EvaluationDriftSignal` must be compatible with
- `lib/src/holiday_peak_lib/evaluation/__init__.py` — current exports that must be preserved

## Acceptance criteria

- [ ] Models module at `lib/src/holiday_peak_lib/evaluation/models.py`
- [ ] All models use Pydantic v2 with `model_config = ConfigDict(frozen=True)`
- [ ] `EvalConfig` includes: `agent_name: str`, `evaluators: list[str]`, `dataset_path: str`, `baseline_path: str | None`, `thresholds: dict[str, float]`, `model_targets: list[str]`
- [ ] `EvalCase` includes: `query: str`, `expected_behavior: str`, `expected_model_tier: str | None`, `context: str | None`, `ground_truth: str | None`
- [ ] `EvalBaseline` stores historical metric snapshots with timestamps
- [ ] `DriftReport` includes: `breached_thresholds: list[str]`, `severity: Literal["warning", "critical"]`, `drift_metrics: dict[str, float]`
- [ ] `EvaluationDriftSignal` compatible with self-healing `FailureSignal` interface
- [ ] Backward-compatible with existing `EvaluationRunResult` dataclass
- [ ] 100% unit test coverage in `lib/tests/test_evaluation_models.py`
- [ ] `evaluation/__init__.py` exports updated

## Risks and dependencies

- Risk: Schema changes later require migration — freeze models early, iterate via versioned additive fields
- Dependency: E-01 (ADR-038 accepted — provides schema specification)

## ADR impact

- ADR-038 (schema section)
- No impact on other ADRs

## BPMN process

```mermaid
%%{init: {'theme':'base', 'themeVariables': {
  'primaryColor':'#FFB3BA',
  'primaryTextColor':'#000',
  'primaryBorderColor':'#FF8B94',
  'lineColor':'#BAE1FF',
  'secondaryColor':'#BAE1FF',
  'tertiaryColor':'#FFFFFF'
}}}%%
flowchart LR
  A[Review ADR-038 Schema Spec] --> B[Define Pydantic Models]
  B --> C[Verify FailureSignal Compat]
  C --> D[Write Unit Tests]
  D --> E[Update __init__ Exports]
  E --> F[PR Validation]
  F --> G[Merge and Monitor]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P0] evaluation: Pydantic models — EvalConfig, EvalCase, EvalBaseline, DriftReport #897

Business context

Scope

Current behavior evidence

Acceptance criteria

Risks and dependencies

ADR impact

BPMN process

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[P0] evaluation: Pydantic models — EvalConfig, EvalCase, EvalBaseline, DriftReport #897

Description

Business context

Scope

Current behavior evidence

Acceptance criteria

Risks and dependencies

ADR impact

BPMN process

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions