feat(meta-eval): judge sentinel — drift, decay, and silent-upgrade alarms#239
Merged
Conversation
…arms
Composes the existing instruments (calibrateJudge/calibrateJudgeContinuous,
corpusInterRaterAgreement/continuousAgreement, analyzeSeries) into one
snapshot → series → alarm loop for judge trustworthiness:
- SentinelSnapshot + adapters from real instrument outputs
(snapshotFromCalibration, snapshotFromAgreement, snapshotFromSentinelSet)
- SentinelStore with inMemorySentinelStore + fileSentinelStore (JSONL,
corrupt or shape-invalid line fails loud with file:line)
- judgeSentinelReport: per-judge×metric trends through the real
series-convergence state machine, with alarms for drifting-down,
irr-below-floor, kappa/pass-rate drops vs baseline, staleness vs asOf,
and the silent-upgrade trap (judge model changed with no post-change
golden-grounded snapshot — irr alone does not clear it)
- evalHealthStamp: the tiny { healthy, alarms } object a campaign
attaches to its verdicts post-campaign or nightly
Module is clock-free (all timestamps caller-supplied ISO) and exported
via the ./meta-eval subpath only.
tangletools
approved these changes
Jun 10, 2026
tangletools
left a comment
Contributor
There was a problem hiding this comment.
✅ Auto-approved PR — c7a0247b
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-10T10:43:29Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Eval trustworthiness as a continuously measured, alarmed trend. Judges are models; models change underneath us; calibration decays. This composes the existing dormant instruments into one snapshot → series → alarm loop on the
./meta-evalsubpath — no instrument re-implemented, no version bump.Surface (
src/meta-eval/sentinel.ts, exported via./meta-eval)SentinelSnapshot{ at, judgeId, judgeModel, metrics: { irr?, calibrationKappa?, sentinelPassRate? } }— caller-supplied ISO timestamps, module is clock-free. Built by adapters from real instrument outputs:snapshotFromCalibration←calibrateJudge/calibrateJudgeContinuous(prefers the un-rounded continuous κ_w when present)snapshotFromAgreement←corpusInterRaterAgreement(statistics.ts) orcontinuousAgreement(IRR = ICC(2,1))snapshotFromSentinelSet← a rerun golden set; tolerance-gated pass rate, zero joins = loud error not a 0%SentinelStore—inMemorySentinelStore()+fileSentinelStore(path)(JSONL; corrupt or shape-invalid line fails loud with file:line; unknown metric keys rejected so a typo can't silently never trend)judgeSentinelReport(history, { asOf, thresholds })— per-judge×metric trends through the realanalyzeSeriesstate machine (stabilized/drifting-up/drifting-down/noisy/insufficient-data, verbatim). Alarms:drifting-downirrbelowminIrr(default 0.6) — applies even to thin seriescalibrationKappa/sentinelPassRatedrop vs series baseline beyondmaxKappaDrop0.15 /maxSentinelDrop0.1 (catches the cliff the trend machine callsnoisy)judgeModelchanged with no post-change golden-grounded snapshot.irralone does NOT clear it — judges agreeing with each other after a model swap proves nothing; only re-measuring against gold does. Empty-metrics snapshots are documented model-change markers.staleAfterDays(default 30) vsasOfinsufficientHistorynamesjudgeId:metricblind spots instead of hiding themevalHealthStamp(report)→{ healthy, alarms }— the tiny object a campaign attaches to its verdicts (post-campaign hook / nightly). An alarmed sentinel means downstream verdicts carry an integrity warning until recalibration clears it.Reuse over rebuild
analyzeSeries(src/series-convergence) — real state machine, real vocabulary, no forkCalibrationResult/ContinuousCalibrationResultfrom src/judge-calibrationCorpusAgreementReport(src/statistics) +ContinuousAgreementfileExperimentStore(typed loud errors, ENOENT → empty)One spec note: the prompt pointed at
calibration.ts calibrationFromPairsforcalibrationKappa— that instrument outputs ECE, not κ. The real κ-vs-gold lives injudge-calibration.ts(calibrateJudge*), so the adapter grounds there;calibrationFromPairs' ECE report stays available for its own purpose.Tests (27, all deterministic — synthetic histories, fixed timestamps, no LLM calls)
decay → drifting-down alarm; κ-cliff drop alarm; model-change-without-recalibration alarms (+ does-not-alarm when recalibration follows, + irr-does-not-clear-it); healthy stays healthy; staleness vs asOf both ways; insufficientHistory named; file store roundtrip across instances + corrupt-line and shape-invalid-line loud failures; series-convergence integration asserts row state
=== analyzeSeries(values).state(the real exported machine) including options forwarding; adapters exercised against the real instruments (calibrateJudge,continuousAgreement,corpusInterRaterAgreementwithbootstrap: 0).Gates:
pnpm typecheck✓ ·pnpm test208 files / 2009 passed ✓ ·pnpm build✓ (dist subpath exports verified). No version bump; no edits tosrc/judges.ts/src/judge-ensemble.ts(in-flight spine PR); no rootsrc/index.tschanges (meta-eval is subpath-only by design).