Skip to content

joaquinhuigomez/llm-judge-calibrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llm-judge-calibrator

LLM judges have hidden biases. They favor the first response shown (position bias), prefer longer answers (verbosity bias), and sometimes rate their own outputs higher (self-preference). Most evaluation frameworks bury this under abstraction layers or ignore it entirely.

This tool runs position-swap experiments on your LLM judge and produces a calibration report with concrete metrics: Cohen's Kappa for reliability, position/verbosity/self-preference bias rates, study-oriented agreement metrics, and an overall grade.

Quickstart

pip install -e ".[dev]"
export OPENAI_API_KEY=sk-...  # or any litellm-supported provider
calibrate examples/sample_pairs.jsonl --model gpt-4o --output report.json

Run the TruthfulQA study scaffold:

python studies/truthfulqa_100/build_truthfulqa_study.py
calibrate-study studies/truthfulqa_100/pairs.jsonl \
  --model gpt-5.4 \
  --output studies/truthfulqa_100/results/gpt-5.4.json

Launch the review UI:

label-ui studies/truthfulqa_100/pairs.jsonl \
  --config studies/truthfulqa_100/label_ui_config.json \
  --output studies/truthfulqa_100/pairs.reviewed.jsonl

How it works

  1. Each evaluation pair is judged twice: once in the original order (A, B) and once swapped (B, A)
  2. Swapped judgments are mapped back to original labels
  3. Agreement between the two runs is measured via Cohen's Kappa
  4. Bias detectors flag position, verbosity, and self-preference patterns
  5. An overall grade (A-F) summarizes judge reliability

Input format

JSONL with one object per line:

{"id": "pair-1", "prompt": "...", "response_a": "...", "response_b": "...", "metadata": {}}

Optional metadata.model_a and metadata.model_b fields enable self-preference detection.

The study workflow also supports optional fields such as reference_answer, pair_type, human_label, label_reason, preferred_side, and subset_tags.

Example output

============================================================
  LLM JUDGE CALIBRATION REPORT
============================================================

  Overall Grade:  B

  --- Inter-Rater Reliability ---
  Cohen's Kappa:     0.650 (substantial)
  Percent Agreement: 80.0%
  Pairs Evaluated:   10

  --- Bias Detection ---
  Position Bias:     20.0% (favors_first)
  Verbosity Bias:    r=0.312 (favors_longer)

============================================================

Grading scale

Grade Criteria
A Kappa > 0.8, position bias < 10%
B Kappa > 0.6, position bias < 20%
C Kappa > 0.4, position bias < 30%
D Kappa > 0.2
F Kappa <= 0.2

Alternatives

  • FastChat/MT-Bench: Position swap but no statistical metrics
  • DeepEval: Bias detection buried in a large framework
  • This tool: Standalone diagnostic. One command, one report, actionable numbers.

Study Assets

Development

pip install -e ".[dev]"
pytest tests/ -v
ruff check src/

About

Detect position bias, verbosity bias, and self-preference in LLM judges. Position-swap evaluation + Cohen's Kappa + calibration report.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors