LLM judges have hidden biases. They favor the first response shown (position bias), prefer longer answers (verbosity bias), and sometimes rate their own outputs higher (self-preference). Most evaluation frameworks bury this under abstraction layers or ignore it entirely.
This tool runs position-swap experiments on your LLM judge and produces a calibration report with concrete metrics: Cohen's Kappa for reliability, position/verbosity/self-preference bias rates, study-oriented agreement metrics, and an overall grade.
pip install -e ".[dev]"
export OPENAI_API_KEY=sk-... # or any litellm-supported provider
calibrate examples/sample_pairs.jsonl --model gpt-4o --output report.jsonRun the TruthfulQA study scaffold:
python studies/truthfulqa_100/build_truthfulqa_study.py
calibrate-study studies/truthfulqa_100/pairs.jsonl \
--model gpt-5.4 \
--output studies/truthfulqa_100/results/gpt-5.4.jsonLaunch the review UI:
label-ui studies/truthfulqa_100/pairs.jsonl \
--config studies/truthfulqa_100/label_ui_config.json \
--output studies/truthfulqa_100/pairs.reviewed.jsonl- Each evaluation pair is judged twice: once in the original order (A, B) and once swapped (B, A)
- Swapped judgments are mapped back to original labels
- Agreement between the two runs is measured via Cohen's Kappa
- Bias detectors flag position, verbosity, and self-preference patterns
- An overall grade (A-F) summarizes judge reliability
JSONL with one object per line:
{"id": "pair-1", "prompt": "...", "response_a": "...", "response_b": "...", "metadata": {}}Optional metadata.model_a and metadata.model_b fields enable self-preference detection.
The study workflow also supports optional fields such as reference_answer, pair_type, human_label, label_reason, preferred_side, and subset_tags.
============================================================
LLM JUDGE CALIBRATION REPORT
============================================================
Overall Grade: B
--- Inter-Rater Reliability ---
Cohen's Kappa: 0.650 (substantial)
Percent Agreement: 80.0%
Pairs Evaluated: 10
--- Bias Detection ---
Position Bias: 20.0% (favors_first)
Verbosity Bias: r=0.312 (favors_longer)
============================================================
| Grade | Criteria |
|---|---|
| A | Kappa > 0.8, position bias < 10% |
| B | Kappa > 0.6, position bias < 20% |
| C | Kappa > 0.4, position bias < 30% |
| D | Kappa > 0.2 |
| F | Kappa <= 0.2 |
- FastChat/MT-Bench: Position swap but no statistical metrics
- DeepEval: Bias detection buried in a large framework
- This tool: Standalone diagnostic. One command, one report, actionable numbers.
- TruthfulQA study scaffold
- Exploratory pilot benchmark
- OpenAI Evals export via
export-openai-evalsorcalibrate-study --export-openai-evals ... - Human review UI via
label-ui
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/