Context
Child of #90. Blocks on #109 (vLLM serving). Final step — the trained model becomes the cascade's slow tier in production, under shadow mode.
Scope
1. Config change
Production config flip from today's slow_backend: null to the new model:
judge:
backend: cascade
cascade:
fast_backend: deberta
slow_backend: vllm
vllm:
base_url: "http://vllm-judge.internal:8000"
model: "llmtrace-qwen-judge-v1"
max_tokens: 512
temperature: 0.1
allow_plaintext: false # service is loopback in-cluster, overridden below
promotion:
shadow: true # MANDATORY for the first 1000 verdicts
Shadow-mode is non-negotiable for first rollout. Even post-acceptance-gate, real traffic is different from eval traffic.
2. Monitoring dashboards
Add Grafana panels (or update existing) that show:
llmtrace_judge_requests_total{model=\"llmtrace-qwen-judge-v1\"} rate.
llmtrace_judge_latency_seconds{model=\"llmtrace-qwen-judge-v1\"} p50/p95/p99.
llmtrace_judge_shadow_would_block_total{model=...} vs existing protectai/deberta-v3-base-prompt-injection-v2 — the differential is the escalation rate.
llmtrace_judge_verdict_agreement{agreement=...} — slow-tier agreement with ensemble prior findings.
3. Shadow-mode calibration run
For 7 days or 1000 verdicts (whichever first):
- Shadow on; no verdict changes enforcement.
- Export verdicts nightly from
judge_verdicts table (or InMemoryJudgeVerdictStore for lite profile in dev).
- Fit a reliability diagram on the observed confidences.
- Pick
promotion.min_confidence at target FP rate (default floor 0.7 remains unless data says otherwise).
- Pick
ambiguous_low and ambiguous_high on the fast-tier DeBERTa confidence distribution — band should cover the region where DeBERTa's precision starts dropping.
4. Enforcement switch
After calibration review, flip shadow: false. Keep shadow: true as a rollback knob — one config push returns to monitored-only.
5. Publish a production eval report
Mirror the docs/research/results/judge_evaluation_*.md format with the observed production numbers:
File: docs/research/results/judge_qwen-v1_production_<date>.md.
Acceptance
Rollback
Config push: promotion.shadow: true restores shadow mode. One command, no restart.
Context
Child of #90. Blocks on #109 (vLLM serving). Final step — the trained model becomes the cascade's slow tier in production, under shadow mode.
Scope
1. Config change
Production config flip from today's
slow_backend: nullto the new model:Shadow-mode is non-negotiable for first rollout. Even post-acceptance-gate, real traffic is different from eval traffic.
2. Monitoring dashboards
Add Grafana panels (or update existing) that show:
llmtrace_judge_requests_total{model=\"llmtrace-qwen-judge-v1\"}rate.llmtrace_judge_latency_seconds{model=\"llmtrace-qwen-judge-v1\"}p50/p95/p99.llmtrace_judge_shadow_would_block_total{model=...}vs existingprotectai/deberta-v3-base-prompt-injection-v2— the differential is the escalation rate.llmtrace_judge_verdict_agreement{agreement=...}— slow-tier agreement with ensemble prior findings.3. Shadow-mode calibration run
For 7 days or 1000 verdicts (whichever first):
judge_verdictstable (or InMemoryJudgeVerdictStore for lite profile in dev).promotion.min_confidenceat target FP rate (default floor 0.7 remains unless data says otherwise).ambiguous_lowandambiguous_highon the fast-tier DeBERTa confidence distribution — band should cover the region where DeBERTa's precision starts dropping.4. Enforcement switch
After calibration review, flip
shadow: false. Keepshadow: trueas a rollback knob — one config push returns to monitored-only.5. Publish a production eval report
Mirror the
docs/research/results/judge_evaluation_*.mdformat with the observed production numbers:File:
docs/research/results/judge_qwen-v1_production_<date>.md.Acceptance
/healthreportsjudge.worker_spawned: truewith the cascade composed model labeldeberta+vllm.Rollback
Config push:
promotion.shadow: truerestores shadow mode. One command, no restart.