research(judge-ft): acceptance-gate evaluation + DeBERTa head-to-head

## Context

Child of #90. Blocks on #107 (training completion).

The training loop produces checkpoints. This issue is the independent evaluation that decides whether any of them are good enough to ship as LLMTrace's slow-judge.

Separation from #107 is deliberate: training judges its own success using its training val set. This issue judges success using *held-out* sets the training never saw, using the same benchmark harness (`judge_benchmark.rs`) we already used for gpt-4o-mini.

## Scope

### 1. Pull the LoRA into LLMTrace's benchmark harness

Merge the LoRA adapter into Qwen2.5-0.5B-Instruct using `peft`'s `merge_and_unload`. Save as HF format. Serve with `vllm serve`.

Point `judge_benchmark.rs` at it:

```bash
JUDGE_BASE_URL=http://localhost:8000 \
JUDGE_MODEL=llmtrace-qwen-judge-v1-local \
BENCH_EXTERNAL_DIR=benchmarks/datasets/external \
BENCH_MAX_PER_SET=50 \
./target/release/examples/judge_benchmark
```

### 2. Head-to-head comparison

Compare three models on the identical seed-42 sample set used for the gpt-4o-mini evaluation report (`docs/research/results/judge_evaluation_gpt4o_mini_2026-04-20.md`):

| Model | Role |
|---|---|
| `openai/gpt-4o-mini` | Baseline (already reported) |
| `protectai/deberta-v3-base-prompt-injection-v2` | Fast-judge reference |
| `llmtrace-qwen-judge-v1` | Candidate slow-judge |

Report for each: F1, precision, recall, FPR, per-dataset breakdown.

### 3. Calibration analysis

For the Qwen judge:

- Reliability diagram (10-bucket) — `confidence` on x-axis, observed accuracy on y-axis.
- Brier score on held-out val set.
- ECE (Expected Calibration Error) — 10 buckets.
- Per-category confusion matrix — is the judge strong on direct injection but weak on data exfiltration, for example.

### 4. Failure-mode analysis

Sample 30 failures (10 false negatives on malicious, 10 false positives on benign, 10 mis-categorised). Read them. Categorise the failure reasons. Attach the list to the report.

### 5. Ship-or-no-ship recommendation

Concrete recommendation based on numbers:

- **Ship**: F1 ≥ 0.80 on the 27-corpus set, FPR ≤ 0.05 on `xstest + notinject_samples + benign_samples`, ECE ≤ 0.10.
- **Ship as slow-tier only**: F1 ≥ 0.75 (cascade is more lenient; DeBERTa handles the easy cases).
- **Don't ship**: anything below.

## Acceptance

- [ ] `docs/research/results/judge_evaluation_qwen-v1_<date>.md` committed — same shape as the gpt-4o-mini report.
- [ ] Head-to-head table with three models.
- [ ] Ship-or-no-ship recommendation with evidence.
- [ ] 30 categorised failure examples attached as an appendix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research(judge-ft): acceptance-gate evaluation + DeBERTa head-to-head #108

Context

Scope

1. Pull the LoRA into LLMTrace's benchmark harness

2. Head-to-head comparison

3. Calibration analysis

4. Failure-mode analysis

5. Ship-or-no-ship recommendation

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Role
`openai/gpt-4o-mini`	Baseline (already reported)
`protectai/deberta-v3-base-prompt-injection-v2`	Fast-judge reference
`llmtrace-qwen-judge-v1`	Candidate slow-judge

research(judge-ft): acceptance-gate evaluation + DeBERTa head-to-head #108

Description

Context

Scope

1. Pull the LoRA into LLMTrace's benchmark harness

2. Head-to-head comparison

3. Calibration analysis

4. Failure-mode analysis

5. Ship-or-no-ship recommendation

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions