feat(judge-ft): wire Qwen slow-judge into LLMTrace cascade + shadow-mode rollout

## Context

Child of #90. Blocks on #109 (vLLM serving). Final step — the trained model becomes the cascade's slow tier in production, under shadow mode.

## Scope

### 1. Config change

Production config flip from today's `slow_backend: null` to the new model:

```yaml
judge:
  backend: cascade
  cascade:
    fast_backend: deberta
    slow_backend: vllm
  vllm:
    base_url: "http://vllm-judge.internal:8000"
    model: "llmtrace-qwen-judge-v1"
    max_tokens: 512
    temperature: 0.1
    allow_plaintext: false        # service is loopback in-cluster, overridden below
  promotion:
    shadow: true                  # MANDATORY for the first 1000 verdicts
```

Shadow-mode is non-negotiable for first rollout. Even post-acceptance-gate, real traffic is different from eval traffic.

### 2. Monitoring dashboards

Add Grafana panels (or update existing) that show:

- `llmtrace_judge_requests_total{model=\"llmtrace-qwen-judge-v1\"}` rate.
- `llmtrace_judge_latency_seconds{model=\"llmtrace-qwen-judge-v1\"}` p50/p95/p99.
- `llmtrace_judge_shadow_would_block_total{model=...}` vs existing `protectai/deberta-v3-base-prompt-injection-v2` — the *differential* is the escalation rate.
- `llmtrace_judge_verdict_agreement{agreement=...}` — slow-tier agreement with ensemble prior findings.

### 3. Shadow-mode calibration run

For 7 days or 1000 verdicts (whichever first):

- Shadow on; no verdict changes enforcement.
- Export verdicts nightly from `judge_verdicts` table (or InMemoryJudgeVerdictStore for lite profile in dev).
- Fit a reliability diagram on the observed confidences.
- Pick `promotion.min_confidence` at target FP rate (default floor 0.7 remains unless data says otherwise).
- Pick `ambiguous_low` and `ambiguous_high` on the fast-tier DeBERTa confidence distribution — band should cover the region where DeBERTa's precision starts dropping.

### 4. Enforcement switch

After calibration review, flip `shadow: false`. Keep `shadow: true` as a rollback knob — one config push returns to monitored-only.

### 5. Publish a production eval report

Mirror the `docs/research/results/judge_evaluation_*.md` format with the observed production numbers:

- 7-day verdict volume + latency distribution (real traffic, not benchmark).
- Agreement rate with DeBERTa fast tier.
- Cost per 1000 requests.
- Delta vs the #108 benchmark numbers — expect some regression on real traffic vs curated corpora.

File: `docs/research/results/judge_qwen-v1_production_<date>.md`.

## Acceptance

- [ ] Config change rolled out; `/health` reports `judge.worker_spawned: true` with the cascade composed model label `deberta+vllm`.
- [ ] Shadow-mode data collected; calibration report published.
- [ ] Production eval report committed.
- [ ] Enforcement flipped; stable for 7 days with no enforcement-driven incidents.

## Rollback

Config push: `promotion.shadow: true` restores shadow mode. One command, no restart.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(judge-ft): wire Qwen slow-judge into LLMTrace cascade + shadow-mode rollout #110

Context

Scope

1. Config change

2. Monitoring dashboards

3. Shadow-mode calibration run

4. Enforcement switch

5. Publish a production eval report

Acceptance

Rollback

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(judge-ft): wire Qwen slow-judge into LLMTrace cascade + shadow-mode rollout #110

Description

Context

Scope

1. Config change

2. Monitoring dashboards

3. Shadow-mode calibration run

4. Enforcement switch

5. Publish a production eval report

Acceptance

Rollback

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions