Adds llm-as-a-judge support via new metric by jbross-ibm-research · Pull Request #3534 · EleutherAI/lm-evaluation-harness

jbross-ibm-research · 2026-01-27T23:14:53Z

Contributing some extension that we are using at IBM Research internally. Tried to implement this minimally invasive without touching much of the existing logic.

There is a readme file in the docs folder with all the details and I have added a runnable example task that uses the Flores200 translation benchmark as a test case.

Addresses feature requests #2233 and #1831

LLM-as-a-Judge Metric

The llm_judge metric enables using a remote LLM (via OpenAI-compatible API) to evaluate model responses. This is useful for tasks where automated metrics like BLEU or exact match are insufficient, and human-like judgment is needed.

Features

Configurable via YAML: All parameters can be set in task configuration files
Jinja2 prompt templates: Full access to document fields via {{ doc.field }}, {{ prediction }}, {{ reference }}
Concurrent API calls: Configurable concurrency (default: 32) for fast batch evaluation
Progress tracking: tqdm progress bar during LLM judge evaluation
Detailed logging: Save prompts, responses, scores, and explanations to JSONL files
OpenAI-compatible: Works with OpenAI API, Claude, vLLM, Ollama, or any compatible endpoint
Automatic aggregation: Built-in mean aggregation across all instances
Automatic retry: Exponential backoff for transient API errors (rate limits, timeouts)
Pre-flight check: Verify API connectivity before batch evaluation
Failure threshold: Configurable max error rate to catch API issues

Installation

The llm_judge metric requires the following packages:

pip install openai jinja2 tqdm tenacity

Optional packages:

genson - Required for structured JSON outputs: pip install genson

Enabling LLM Judge

LLM judge metrics are disabled by default to prevent accidental API costs. Use the --run_llm_judge flag to enable them

How It Works

Architecture (Passthrough Pattern)

The LLM judge uses a passthrough/aggregation pattern similar to BLEU:

llm_judge_fn() - Passthrough function that collects (reference, prediction, doc, config) tuples
llm_judge_agg() - Aggregation function that:
- Processes all items concurrently with ThreadPoolExecutor
- Calls the LLM judge API for each item
- Shows progress with tqdm
- Stores detailed results for later saving
- Returns the mean score

Data Flow

YAML Config → Task → process_results() → llm_judge_fn() [passthrough]
                                               ↓
                                    Collect all (ref, pred, doc, config) tuples
                                               ↓
                                    llm_judge_agg() [concurrent API calls]
                                               ↓
                                    ThreadPoolExecutor + tqdm progress
                                               ↓
                                    Store results → EvaluationTracker saves JSONL
                                               ↓
                                    Return mean score

Output Files

When save_details: true (default) and --output_path is specified, detailed results are saved to:

output_path/
  model_name_sanitized/
    results_<timestamp>.json
    samples_<task>_<timestamp>.jsonl
    llm_judge_<task>_<judge_model>_<timestamp>.jsonl  # LLM judge details

Each line in the JSONL file contains:

{
  "idx": 0,
  "score": 8.5,
  "judgment_raw": "Score: 8.5\n\nThe translation is accurate...",
  "explanation": "The translation is accurate...",
  "formatted_prompt": "You are an expert...",
  "prediction": "Model's response",
  "reference": "Reference answer",
  "error": null
}

Core Implementation

lm_eval/api/metrics.py
- llm_judge_fn() - Passthrough metric function
- llm_judge_agg() - Aggregation function with concurrent API calls
- _call_llm_judge_single() - Single API call helper
- _render_llm_judge_prompt() - Jinja2 template rendering
- get_pending_llm_judge_details() - Retrieves collected results for saving
lm_eval/api/task.py
- Modified process_results() to pass (reference, prediction, doc, config) tuples for llm_judge
lm_eval/loggers/evaluation_tracker.py
- Added save_llm_judge_details() method for saving detailed results

CLAassistant · 2026-01-27T23:15:00Z

All committers have signed the CLA.

jbross-ibm-research added 2 commits January 27, 2026 21:24

initial llm judge implementation

90fe313

fix linter errors

87596ea

jbross-ibm-research requested a review from baberabb as a code owner January 27, 2026 23:14

redpanda1995 approved these changes Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds llm-as-a-judge support via new metric#3534

Adds llm-as-a-judge support via new metric#3534
jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
jbross-ibm-research:add_llm_judge_metric

jbross-ibm-research commented Jan 27, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jbross-ibm-research commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM-as-a-Judge Metric

Features

Installation

Enabling LLM Judge

How It Works

Architecture (Passthrough Pattern)

Data Flow

Output Files

Core Implementation

Uh oh!

CLAassistant commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jbross-ibm-research commented Jan 27, 2026 •

edited

Loading

CLAassistant commented Jan 27, 2026 •

edited

Loading