Skip to content

feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms#239

Merged
drewstone merged 1 commit into
mainfrom
feat/judge-sentinel
Jun 10, 2026
Merged

feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms#239
drewstone merged 1 commit into
mainfrom
feat/judge-sentinel

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Eval trustworthiness as a continuously measured, alarmed trend. Judges are models; models change underneath us; calibration decays. This composes the existing dormant instruments into one snapshot → series → alarm loop on the ./meta-eval subpath — no instrument re-implemented, no version bump.

Surface (src/meta-eval/sentinel.ts, exported via ./meta-eval)

  • SentinelSnapshot { at, judgeId, judgeModel, metrics: { irr?, calibrationKappa?, sentinelPassRate? } } — caller-supplied ISO timestamps, module is clock-free. Built by adapters from real instrument outputs:
    • snapshotFromCalibrationcalibrateJudge / calibrateJudgeContinuous (prefers the un-rounded continuous κ_w when present)
    • snapshotFromAgreementcorpusInterRaterAgreement (statistics.ts) or continuousAgreement (IRR = ICC(2,1))
    • snapshotFromSentinelSet ← a rerun golden set; tolerance-gated pass rate, zero joins = loud error not a 0%
  • SentinelStoreinMemorySentinelStore() + fileSentinelStore(path) (JSONL; corrupt or shape-invalid line fails loud with file:line; unknown metric keys rejected so a typo can't silently never trend)
  • judgeSentinelReport(history, { asOf, thresholds }) — per-judge×metric trends through the real analyzeSeries state machine (stabilized / drifting-up / drifting-down / noisy / insufficient-data, verbatim). Alarms:
    • convergence state drifting-down
    • irr below minIrr (default 0.6) — applies even to thin series
    • calibrationKappa / sentinelPassRate drop vs series baseline beyond maxKappaDrop 0.15 / maxSentinelDrop 0.1 (catches the cliff the trend machine calls noisy)
    • silent-upgrade trap (first-class): judgeModel changed with no post-change golden-grounded snapshot. irr alone does NOT clear it — judges agreeing with each other after a model swap proves nothing; only re-measuring against gold does. Empty-metrics snapshots are documented model-change markers.
    • newest snapshot older than staleAfterDays (default 30) vs asOf
    • insufficientHistory names judgeId:metric blind spots instead of hiding them
  • evalHealthStamp(report){ healthy, alarms } — the tiny object a campaign attaches to its verdicts (post-campaign hook / nightly). An alarmed sentinel means downstream verdicts carry an integrity warning until recalibration clears it.

Reuse over rebuild

  • Trend engine: analyzeSeries (src/series-convergence) — real state machine, real vocabulary, no fork
  • Calibration: CalibrationResult / ContinuousCalibrationResult from src/judge-calibration
  • Agreement: CorpusAgreementReport (src/statistics) + ContinuousAgreement
  • Persistence discipline mirrors fileExperimentStore (typed loud errors, ENOENT → empty)

One spec note: the prompt pointed at calibration.ts calibrationFromPairs for calibrationKappa — that instrument outputs ECE, not κ. The real κ-vs-gold lives in judge-calibration.ts (calibrateJudge*), so the adapter grounds there; calibrationFromPairs' ECE report stays available for its own purpose.

Tests (27, all deterministic — synthetic histories, fixed timestamps, no LLM calls)

decay → drifting-down alarm; κ-cliff drop alarm; model-change-without-recalibration alarms (+ does-not-alarm when recalibration follows, + irr-does-not-clear-it); healthy stays healthy; staleness vs asOf both ways; insufficientHistory named; file store roundtrip across instances + corrupt-line and shape-invalid-line loud failures; series-convergence integration asserts row state === analyzeSeries(values).state (the real exported machine) including options forwarding; adapters exercised against the real instruments (calibrateJudge, continuousAgreement, corpusInterRaterAgreement with bootstrap: 0).

Gates: pnpm typecheck ✓ · pnpm test 208 files / 2009 passed ✓ · pnpm build ✓ (dist subpath exports verified). No version bump; no edits to src/judges.ts / src/judge-ensemble.ts (in-flight spine PR); no root src/index.ts changes (meta-eval is subpath-only by design).

…arms

Composes the existing instruments (calibrateJudge/calibrateJudgeContinuous,
corpusInterRaterAgreement/continuousAgreement, analyzeSeries) into one
snapshot → series → alarm loop for judge trustworthiness:

- SentinelSnapshot + adapters from real instrument outputs
  (snapshotFromCalibration, snapshotFromAgreement, snapshotFromSentinelSet)
- SentinelStore with inMemorySentinelStore + fileSentinelStore (JSONL,
  corrupt or shape-invalid line fails loud with file:line)
- judgeSentinelReport: per-judge×metric trends through the real
  series-convergence state machine, with alarms for drifting-down,
  irr-below-floor, kappa/pass-rate drops vs baseline, staleness vs asOf,
  and the silent-upgrade trap (judge model changed with no post-change
  golden-grounded snapshot — irr alone does not clear it)
- evalHealthStamp: the tiny { healthy, alarms } object a campaign
  attaches to its verdicts post-campaign or nightly

Module is clock-free (all timestamps caller-supplied ISO) and exported
via the ./meta-eval subpath only.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — c7a0247b

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:43:29Z

@drewstone drewstone merged commit 0292cfd into main Jun 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants