Skip to content

research(judge-ft): acceptance-gate evaluation + DeBERTa head-to-head #108

@epappas

Description

@epappas

Context

Child of #90. Blocks on #107 (training completion).

The training loop produces checkpoints. This issue is the independent evaluation that decides whether any of them are good enough to ship as LLMTrace's slow-judge.

Separation from #107 is deliberate: training judges its own success using its training val set. This issue judges success using held-out sets the training never saw, using the same benchmark harness (judge_benchmark.rs) we already used for gpt-4o-mini.

Scope

1. Pull the LoRA into LLMTrace's benchmark harness

Merge the LoRA adapter into Qwen2.5-0.5B-Instruct using peft's merge_and_unload. Save as HF format. Serve with vllm serve.

Point judge_benchmark.rs at it:

JUDGE_BASE_URL=http://localhost:8000 \
JUDGE_MODEL=llmtrace-qwen-judge-v1-local \
BENCH_EXTERNAL_DIR=benchmarks/datasets/external \
BENCH_MAX_PER_SET=50 \
./target/release/examples/judge_benchmark

2. Head-to-head comparison

Compare three models on the identical seed-42 sample set used for the gpt-4o-mini evaluation report (docs/research/results/judge_evaluation_gpt4o_mini_2026-04-20.md):

Model Role
openai/gpt-4o-mini Baseline (already reported)
protectai/deberta-v3-base-prompt-injection-v2 Fast-judge reference
llmtrace-qwen-judge-v1 Candidate slow-judge

Report for each: F1, precision, recall, FPR, per-dataset breakdown.

3. Calibration analysis

For the Qwen judge:

  • Reliability diagram (10-bucket) — confidence on x-axis, observed accuracy on y-axis.
  • Brier score on held-out val set.
  • ECE (Expected Calibration Error) — 10 buckets.
  • Per-category confusion matrix — is the judge strong on direct injection but weak on data exfiltration, for example.

4. Failure-mode analysis

Sample 30 failures (10 false negatives on malicious, 10 false positives on benign, 10 mis-categorised). Read them. Categorise the failure reasons. Attach the list to the report.

5. Ship-or-no-ship recommendation

Concrete recommendation based on numbers:

  • Ship: F1 ≥ 0.80 on the 27-corpus set, FPR ≤ 0.05 on xstest + notinject_samples + benign_samples, ECE ≤ 0.10.
  • Ship as slow-tier only: F1 ≥ 0.75 (cascade is more lenient; DeBERTa handles the easy cases).
  • Don't ship: anything below.

Acceptance

  • docs/research/results/judge_evaluation_qwen-v1_<date>.md committed — same shape as the gpt-4o-mini report.
  • Head-to-head table with three models.
  • Ship-or-no-ship recommendation with evidence.
  • 30 categorised failure examples attached as an appendix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions