Adds llm-as-a-judge support via new metric#3534
Open
jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
Open
Adds llm-as-a-judge support via new metric#3534jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
jbross-ibm-research wants to merge 2 commits intoEleutherAI:mainfrom
Conversation
redpanda1995
approved these changes
Jan 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Contributing some extension that we are using at IBM Research internally. Tried to implement this minimally invasive without touching much of the existing logic.
There is a readme file in the docs folder with all the details and I have added a runnable example task that uses the Flores200 translation benchmark as a test case.
Addresses feature requests #2233 and #1831
LLM-as-a-Judge Metric
The
llm_judgemetric enables using a remote LLM (via OpenAI-compatible API) to evaluate model responses. This is useful for tasks where automated metrics like BLEU or exact match are insufficient, and human-like judgment is needed.Features
{{ doc.field }},{{ prediction }},{{ reference }}Installation
The
llm_judgemetric requires the following packages:Optional packages:
genson- Required for structured JSON outputs:pip install gensonEnabling LLM Judge
LLM judge metrics are disabled by default to prevent accidental API costs. Use the
--run_llm_judgeflag to enable themHow It Works
Architecture (Passthrough Pattern)
The LLM judge uses a passthrough/aggregation pattern similar to BLEU:
llm_judge_fn()- Passthrough function that collects(reference, prediction, doc, config)tuplesllm_judge_agg()- Aggregation function that:Data Flow
Output Files
When
save_details: true(default) and--output_pathis specified, detailed results are saved to:Each line in the JSONL file contains:
{ "idx": 0, "score": 8.5, "judgment_raw": "Score: 8.5\n\nThe translation is accurate...", "explanation": "The translation is accurate...", "formatted_prompt": "You are an expert...", "prediction": "Model's response", "reference": "Reference answer", "error": null }Core Implementation
lm_eval/api/metrics.py
llm_judge_fn()- Passthrough metric functionllm_judge_agg()- Aggregation function with concurrent API calls_call_llm_judge_single()- Single API call helper_render_llm_judge_prompt()- Jinja2 template renderingget_pending_llm_judge_details()- Retrieves collected results for savinglm_eval/api/task.py
process_results()to pass(reference, prediction, doc, config)tuples for llm_judgelm_eval/loggers/evaluation_tracker.py
save_llm_judge_details()method for saving detailed results