Context
Child of #90. Blocks on #107 (training completion).
The training loop produces checkpoints. This issue is the independent evaluation that decides whether any of them are good enough to ship as LLMTrace's slow-judge.
Separation from #107 is deliberate: training judges its own success using its training val set. This issue judges success using held-out sets the training never saw, using the same benchmark harness (judge_benchmark.rs) we already used for gpt-4o-mini.
Scope
1. Pull the LoRA into LLMTrace's benchmark harness
Merge the LoRA adapter into Qwen2.5-0.5B-Instruct using peft's merge_and_unload. Save as HF format. Serve with vllm serve.
Point judge_benchmark.rs at it:
JUDGE_BASE_URL=http://localhost:8000 \
JUDGE_MODEL=llmtrace-qwen-judge-v1-local \
BENCH_EXTERNAL_DIR=benchmarks/datasets/external \
BENCH_MAX_PER_SET=50 \
./target/release/examples/judge_benchmark
2. Head-to-head comparison
Compare three models on the identical seed-42 sample set used for the gpt-4o-mini evaluation report (docs/research/results/judge_evaluation_gpt4o_mini_2026-04-20.md):
| Model |
Role |
openai/gpt-4o-mini |
Baseline (already reported) |
protectai/deberta-v3-base-prompt-injection-v2 |
Fast-judge reference |
llmtrace-qwen-judge-v1 |
Candidate slow-judge |
Report for each: F1, precision, recall, FPR, per-dataset breakdown.
3. Calibration analysis
For the Qwen judge:
- Reliability diagram (10-bucket) —
confidence on x-axis, observed accuracy on y-axis.
- Brier score on held-out val set.
- ECE (Expected Calibration Error) — 10 buckets.
- Per-category confusion matrix — is the judge strong on direct injection but weak on data exfiltration, for example.
4. Failure-mode analysis
Sample 30 failures (10 false negatives on malicious, 10 false positives on benign, 10 mis-categorised). Read them. Categorise the failure reasons. Attach the list to the report.
5. Ship-or-no-ship recommendation
Concrete recommendation based on numbers:
- Ship: F1 ≥ 0.80 on the 27-corpus set, FPR ≤ 0.05 on
xstest + notinject_samples + benign_samples, ECE ≤ 0.10.
- Ship as slow-tier only: F1 ≥ 0.75 (cascade is more lenient; DeBERTa handles the easy cases).
- Don't ship: anything below.
Acceptance
Context
Child of #90. Blocks on #107 (training completion).
The training loop produces checkpoints. This issue is the independent evaluation that decides whether any of them are good enough to ship as LLMTrace's slow-judge.
Separation from #107 is deliberate: training judges its own success using its training val set. This issue judges success using held-out sets the training never saw, using the same benchmark harness (
judge_benchmark.rs) we already used for gpt-4o-mini.Scope
1. Pull the LoRA into LLMTrace's benchmark harness
Merge the LoRA adapter into Qwen2.5-0.5B-Instruct using
peft'smerge_and_unload. Save as HF format. Serve withvllm serve.Point
judge_benchmark.rsat it:2. Head-to-head comparison
Compare three models on the identical seed-42 sample set used for the gpt-4o-mini evaluation report (
docs/research/results/judge_evaluation_gpt4o_mini_2026-04-20.md):openai/gpt-4o-miniprotectai/deberta-v3-base-prompt-injection-v2llmtrace-qwen-judge-v1Report for each: F1, precision, recall, FPR, per-dataset breakdown.
3. Calibration analysis
For the Qwen judge:
confidenceon x-axis, observed accuracy on y-axis.4. Failure-mode analysis
Sample 30 failures (10 false negatives on malicious, 10 false positives on benign, 10 mis-categorised). Read them. Categorise the failure reasons. Attach the list to the report.
5. Ship-or-no-ship recommendation
Concrete recommendation based on numbers:
xstest + notinject_samples + benign_samples, ECE ≤ 0.10.Acceptance
docs/research/results/judge_evaluation_qwen-v1_<date>.mdcommitted — same shape as the gpt-4o-mini report.