Skip to content

Adds llm-as-a-judge support via new metric#3534

Open
jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
jbross-ibm-research:add_llm_judge_metric
Open

Adds llm-as-a-judge support via new metric#3534
jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
jbross-ibm-research:add_llm_judge_metric

Conversation

@jbross-ibm-research
Copy link

@jbross-ibm-research jbross-ibm-research commented Jan 27, 2026

Contributing some extension that we are using at IBM Research internally. Tried to implement this minimally invasive without touching much of the existing logic.

There is a readme file in the docs folder with all the details and I have added a runnable example task that uses the Flores200 translation benchmark as a test case.

Addresses feature requests #2233 and #1831

LLM-as-a-Judge Metric

The llm_judge metric enables using a remote LLM (via OpenAI-compatible API) to evaluate model responses. This is useful for tasks where automated metrics like BLEU or exact match are insufficient, and human-like judgment is needed.

Features

  • Configurable via YAML: All parameters can be set in task configuration files
  • Jinja2 prompt templates: Full access to document fields via {{ doc.field }}, {{ prediction }}, {{ reference }}
  • Concurrent API calls: Configurable concurrency (default: 32) for fast batch evaluation
  • Progress tracking: tqdm progress bar during LLM judge evaluation
  • Detailed logging: Save prompts, responses, scores, and explanations to JSONL files
  • OpenAI-compatible: Works with OpenAI API, Claude, vLLM, Ollama, or any compatible endpoint
  • Automatic aggregation: Built-in mean aggregation across all instances
  • Automatic retry: Exponential backoff for transient API errors (rate limits, timeouts)
  • Pre-flight check: Verify API connectivity before batch evaluation
  • Failure threshold: Configurable max error rate to catch API issues

Installation

The llm_judge metric requires the following packages:

pip install openai jinja2 tqdm tenacity

Optional packages:

  • genson - Required for structured JSON outputs: pip install genson

Enabling LLM Judge

LLM judge metrics are disabled by default to prevent accidental API costs. Use the --run_llm_judge flag to enable them

How It Works

Architecture (Passthrough Pattern)

The LLM judge uses a passthrough/aggregation pattern similar to BLEU:

  1. llm_judge_fn() - Passthrough function that collects (reference, prediction, doc, config) tuples
  2. llm_judge_agg() - Aggregation function that:
    • Processes all items concurrently with ThreadPoolExecutor
    • Calls the LLM judge API for each item
    • Shows progress with tqdm
    • Stores detailed results for later saving
    • Returns the mean score

Data Flow

YAML Config → Task → process_results() → llm_judge_fn() [passthrough]
                                               ↓
                                    Collect all (ref, pred, doc, config) tuples
                                               ↓
                                    llm_judge_agg() [concurrent API calls]
                                               ↓
                                    ThreadPoolExecutor + tqdm progress
                                               ↓
                                    Store results → EvaluationTracker saves JSONL
                                               ↓
                                    Return mean score

Output Files

When save_details: true (default) and --output_path is specified, detailed results are saved to:

output_path/
  model_name_sanitized/
    results_<timestamp>.json
    samples_<task>_<timestamp>.jsonl
    llm_judge_<task>_<judge_model>_<timestamp>.jsonl  # LLM judge details

Each line in the JSONL file contains:

{
  "idx": 0,
  "score": 8.5,
  "judgment_raw": "Score: 8.5\n\nThe translation is accurate...",
  "explanation": "The translation is accurate...",
  "formatted_prompt": "You are an expert...",
  "prediction": "Model's response",
  "reference": "Reference answer",
  "error": null
}

Core Implementation

  • lm_eval/api/metrics.py

    • llm_judge_fn() - Passthrough metric function
    • llm_judge_agg() - Aggregation function with concurrent API calls
    • _call_llm_judge_single() - Single API call helper
    • _render_llm_judge_prompt() - Jinja2 template rendering
    • get_pending_llm_judge_details() - Retrieves collected results for saving
  • lm_eval/api/task.py

    • Modified process_results() to pass (reference, prediction, doc, config) tuples for llm_judge
  • lm_eval/loggers/evaluation_tracker.py

    • Added save_llm_judge_details() method for saving detailed results

@CLAassistant
Copy link

CLAassistant commented Jan 27, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants