feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms by drewstone · Pull Request #239 · tangle-network/agent-eval

drewstone · 2026-06-10T10:43:22Z

What

Eval trustworthiness as a continuously measured, alarmed trend. Judges are models; models change underneath us; calibration decays. This composes the existing dormant instruments into one snapshot → series → alarm loop on the ./meta-eval subpath — no instrument re-implemented, no version bump.

Surface (`src/meta-eval/sentinel.ts`, exported via `./meta-eval`)

SentinelSnapshot { at, judgeId, judgeModel, metrics: { irr?, calibrationKappa?, sentinelPassRate? } } — caller-supplied ISO timestamps, module is clock-free. Built by adapters from real instrument outputs:
- snapshotFromCalibration ← calibrateJudge / calibrateJudgeContinuous (prefers the un-rounded continuous κ_w when present)
- snapshotFromAgreement ← corpusInterRaterAgreement (statistics.ts) or continuousAgreement (IRR = ICC(2,1))
- snapshotFromSentinelSet ← a rerun golden set; tolerance-gated pass rate, zero joins = loud error not a 0%
SentinelStore — inMemorySentinelStore() + fileSentinelStore(path) (JSONL; corrupt or shape-invalid line fails loud with file:line; unknown metric keys rejected so a typo can't silently never trend)
judgeSentinelReport(history, { asOf, thresholds }) — per-judge×metric trends through the real analyzeSeries state machine (stabilized / drifting-up / drifting-down / noisy / insufficient-data, verbatim). Alarms:
- convergence state drifting-down
- irr below minIrr (default 0.6) — applies even to thin series
- calibrationKappa / sentinelPassRate drop vs series baseline beyond maxKappaDrop 0.15 / maxSentinelDrop 0.1 (catches the cliff the trend machine calls noisy)
- silent-upgrade trap (first-class): judgeModel changed with no post-change golden-grounded snapshot. irr alone does NOT clear it — judges agreeing with each other after a model swap proves nothing; only re-measuring against gold does. Empty-metrics snapshots are documented model-change markers.
- newest snapshot older than staleAfterDays (default 30) vs asOf
- insufficientHistory names judgeId:metric blind spots instead of hiding them
evalHealthStamp(report) → { healthy, alarms } — the tiny object a campaign attaches to its verdicts (post-campaign hook / nightly). An alarmed sentinel means downstream verdicts carry an integrity warning until recalibration clears it.

Reuse over rebuild

Trend engine: analyzeSeries (src/series-convergence) — real state machine, real vocabulary, no fork
Calibration: CalibrationResult / ContinuousCalibrationResult from src/judge-calibration
Agreement: CorpusAgreementReport (src/statistics) + ContinuousAgreement
Persistence discipline mirrors fileExperimentStore (typed loud errors, ENOENT → empty)

One spec note: the prompt pointed at calibration.ts calibrationFromPairs for calibrationKappa — that instrument outputs ECE, not κ. The real κ-vs-gold lives in judge-calibration.ts (calibrateJudge*), so the adapter grounds there; calibrationFromPairs' ECE report stays available for its own purpose.

Tests (27, all deterministic — synthetic histories, fixed timestamps, no LLM calls)

decay → drifting-down alarm; κ-cliff drop alarm; model-change-without-recalibration alarms (+ does-not-alarm when recalibration follows, + irr-does-not-clear-it); healthy stays healthy; staleness vs asOf both ways; insufficientHistory named; file store roundtrip across instances + corrupt-line and shape-invalid-line loud failures; series-convergence integration asserts row state === analyzeSeries(values).state (the real exported machine) including options forwarding; adapters exercised against the real instruments (calibrateJudge, continuousAgreement, corpusInterRaterAgreement with bootstrap: 0).

Gates: pnpm typecheck ✓ · pnpm test 208 files / 2009 passed ✓ · pnpm build ✓ (dist subpath exports verified). No version bump; no edits to src/judges.ts / src/judge-ensemble.ts (in-flight spine PR); no root src/index.ts changes (meta-eval is subpath-only by design).

…arms Composes the existing instruments (calibrateJudge/calibrateJudgeContinuous, corpusInterRaterAgreement/continuousAgreement, analyzeSeries) into one snapshot → series → alarm loop for judge trustworthiness: - SentinelSnapshot + adapters from real instrument outputs (snapshotFromCalibration, snapshotFromAgreement, snapshotFromSentinelSet) - SentinelStore with inMemorySentinelStore + fileSentinelStore (JSONL, corrupt or shape-invalid line fails loud with file:line) - judgeSentinelReport: per-judge×metric trends through the real series-convergence state machine, with alarms for drifting-down, irr-below-floor, kappa/pass-rate drops vs baseline, staleness vs asOf, and the silent-upgrade trap (judge model changed with no post-change golden-grounded snapshot — irr alone does not clear it) - evalHealthStamp: the tiny { healthy, alarms } object a campaign attaches to its verdicts post-campaign or nightly Module is clock-free (all timestamps caller-supplied ISO) and exported via the ./meta-eval subpath only.

tangletools

✅ Auto-approved PR — `c7a0247b`

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:43:29Z}

tangletools approved these changes Jun 10, 2026

View reviewed changes

drewstone merged commit 0292cfd into main Jun 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms#239

feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms#239
drewstone merged 1 commit into
mainfrom
feat/judge-sentinel

drewstone commented Jun 10, 2026

Uh oh!

tangletools left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 10, 2026

What

Surface (src/meta-eval/sentinel.ts, exported via ./meta-eval)

Reuse over rebuild

Tests (27, all deterministic — synthetic histories, fixed timestamps, no LLM calls)

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Auto-approved PR — c7a0247b

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Surface (`src/meta-eval/sentinel.ts`, exported via `./meta-eval`)

✅ Auto-approved PR — `c7a0247b`