Skip to content

feat(judge-ft): wire Qwen slow-judge into LLMTrace cascade + shadow-mode rollout #110

@epappas

Description

@epappas

Context

Child of #90. Blocks on #109 (vLLM serving). Final step — the trained model becomes the cascade's slow tier in production, under shadow mode.

Scope

1. Config change

Production config flip from today's slow_backend: null to the new model:

judge:
  backend: cascade
  cascade:
    fast_backend: deberta
    slow_backend: vllm
  vllm:
    base_url: "http://vllm-judge.internal:8000"
    model: "llmtrace-qwen-judge-v1"
    max_tokens: 512
    temperature: 0.1
    allow_plaintext: false        # service is loopback in-cluster, overridden below
  promotion:
    shadow: true                  # MANDATORY for the first 1000 verdicts

Shadow-mode is non-negotiable for first rollout. Even post-acceptance-gate, real traffic is different from eval traffic.

2. Monitoring dashboards

Add Grafana panels (or update existing) that show:

  • llmtrace_judge_requests_total{model=\"llmtrace-qwen-judge-v1\"} rate.
  • llmtrace_judge_latency_seconds{model=\"llmtrace-qwen-judge-v1\"} p50/p95/p99.
  • llmtrace_judge_shadow_would_block_total{model=...} vs existing protectai/deberta-v3-base-prompt-injection-v2 — the differential is the escalation rate.
  • llmtrace_judge_verdict_agreement{agreement=...} — slow-tier agreement with ensemble prior findings.

3. Shadow-mode calibration run

For 7 days or 1000 verdicts (whichever first):

  • Shadow on; no verdict changes enforcement.
  • Export verdicts nightly from judge_verdicts table (or InMemoryJudgeVerdictStore for lite profile in dev).
  • Fit a reliability diagram on the observed confidences.
  • Pick promotion.min_confidence at target FP rate (default floor 0.7 remains unless data says otherwise).
  • Pick ambiguous_low and ambiguous_high on the fast-tier DeBERTa confidence distribution — band should cover the region where DeBERTa's precision starts dropping.

4. Enforcement switch

After calibration review, flip shadow: false. Keep shadow: true as a rollback knob — one config push returns to monitored-only.

5. Publish a production eval report

Mirror the docs/research/results/judge_evaluation_*.md format with the observed production numbers:

File: docs/research/results/judge_qwen-v1_production_<date>.md.

Acceptance

  • Config change rolled out; /health reports judge.worker_spawned: true with the cascade composed model label deberta+vllm.
  • Shadow-mode data collected; calibration report published.
  • Production eval report committed.
  • Enforcement flipped; stable for 7 days with no enforcement-driven incidents.

Rollback

Config push: promotion.shadow: true restores shadow mode. One command, no restart.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions