llm-judge-calibrator

LLM judges have hidden biases. They favor the first response shown (position bias), prefer longer answers (verbosity bias), and sometimes rate their own outputs higher (self-preference). Most evaluation frameworks bury this under abstraction layers or ignore it entirely.

This tool runs position-swap experiments on your LLM judge and produces a calibration report with concrete metrics: Cohen's Kappa for reliability, position/verbosity/self-preference bias rates, study-oriented agreement metrics, and an overall grade.

Quickstart

pip install -e ".[dev]"
export OPENAI_API_KEY=sk-...  # or any litellm-supported provider
calibrate examples/sample_pairs.jsonl --model gpt-4o --output report.json

Run the TruthfulQA study scaffold:

python studies/truthfulqa_100/build_truthfulqa_study.py
calibrate-study studies/truthfulqa_100/pairs.jsonl \
  --model gpt-5.4 \
  --output studies/truthfulqa_100/results/gpt-5.4.json

Launch the review UI:

label-ui studies/truthfulqa_100/pairs.jsonl \
  --config studies/truthfulqa_100/label_ui_config.json \
  --output studies/truthfulqa_100/pairs.reviewed.jsonl

How it works

Each evaluation pair is judged twice: once in the original order (A, B) and once swapped (B, A)
Swapped judgments are mapped back to original labels
Agreement between the two runs is measured via Cohen's Kappa
Bias detectors flag position, verbosity, and self-preference patterns
An overall grade (A-F) summarizes judge reliability

Input format

JSONL with one object per line:

{"id": "pair-1", "prompt": "...", "response_a": "...", "response_b": "...", "metadata": {}}

Optional metadata.model_a and metadata.model_b fields enable self-preference detection.

The study workflow also supports optional fields such as reference_answer, pair_type, human_label, label_reason, preferred_side, and subset_tags.

Example output

============================================================
  LLM JUDGE CALIBRATION REPORT
============================================================

  Overall Grade:  B

  --- Inter-Rater Reliability ---
  Cohen's Kappa:     0.650 (substantial)
  Percent Agreement: 80.0%
  Pairs Evaluated:   10

  --- Bias Detection ---
  Position Bias:     20.0% (favors_first)
  Verbosity Bias:    r=0.312 (favors_longer)

============================================================

Grading scale

Grade	Criteria
A	Kappa > 0.8, position bias < 10%
B	Kappa > 0.6, position bias < 20%
C	Kappa > 0.4, position bias < 30%
D	Kappa > 0.2
F	Kappa <= 0.2

Alternatives

FastChat/MT-Bench: Position swap but no statistical metrics
DeepEval: Bias detection buried in a large framework
This tool: Standalone diagnostic. One command, one report, actionable numbers.

Study Assets

TruthfulQA study scaffold
Exploratory pilot benchmark
OpenAI Evals export via export-openai-evals or calibrate-study --export-openai-evals ...
Human review UI via label-ui

Development

pip install -e ".[dev]"
pytest tests/ -v
ruff check src/

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
benchmark		benchmark
examples		examples
src/calibrator		src/calibrator
studies/truthfulqa_100		studies/truthfulqa_100
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-judge-calibrator

Quickstart

How it works

Input format

Example output

Grading scale

Alternatives

Study Assets

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-judge-calibrator

Quickstart

How it works

Input format

Example output

Grading scale

Alternatives

Study Assets

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages