[AI Explainability] Add LLM as a Judge test in LMEval#584
[AI Explainability] Add LLM as a Judge test in LMEval#584sheltoncyril merged 9 commits intoopendatahub-io:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds two LM Eval task configuration constants ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. ✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/verified |
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/wip', '/build-push-pr-image', '/cherry-pick', '/verified', '/lgtm', '/hold'} |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
tests/model_explainability/lm_eval/utils.py (2)
105-105: 40-min SUCCEEDED wait: make configurable and align with CI timeoutsBumping to 40 minutes can hide hangs and exceed CI job-level timeouts. Make this per-call configurable and keep the default here, while letting the slow LLMAAJ test pass a larger value. Also, align status access style and avoid mixing
lmevaljob_pod.StatusandPod.Status.-def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod) -> None: +def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod, succeeded_timeout: int = Timeout.TIMEOUT_20MIN) -> None: @@ - lmevaljob_pod.wait_for_status(status=lmevaljob_pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN) + lmevaljob_pod.wait_for_status(status=Pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN) try: - lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=Timeout.TIMEOUT_40MIN) + lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=succeeded_timeout)Action: In the LLMAAJ test, pass
succeeded_timeout=Timeout.TIMEOUT_40MIN.
18-26: Docstring default mismatchDocstring says default is
TIMEOUT_2MIN, but function usesTIMEOUT_10MIN.- timeout: How long to wait for the pod, defaults to TIMEOUT_2MIN + timeout: How long to wait for the pod, defaults to TIMEOUT_10MINtests/model_explainability/lm_eval/constants.py (3)
61-62: HF engine fp16 may fail on CPU-only runners
"use_fp16": truecan error without CUDA. If CI nodes lack GPUs, gate this via config or provide CPU-safe fallback.Example:
- "use_fp16": true + "use_fp16": true, + "device": "auto"(Adjust to whatever your inference engine supports.)
80-85: Filtering on string literal "[]": brittleThe dataset’s
referenceappears to be a serialized list. If format changes, this filter breaks. Preferliteral_evalonreferencefirst, then compare to an empty list.- { - "__type__": "filter_by_condition", - "values": { - "reference": "[]" - }, - "condition": "eq" - }, + { + "__type__": "literal_eval", + "field": "reference" + }, + { + "__type__": "filter_by_condition", + "values": { + "reference": [] + }, + "condition": "eq" + },(Assuming
filter_by_conditionsupports non-string values.)
41-44: Limit JSON parsing to template literals and simplify with raw strings
The validation script currently attempts to parse every"value"field as JSON—even plain instruction strings—causing false errors (e.g.Be concise…). Restrict JSON validation to entries whose value begins with{(the actual templates), and consider replacing those concatenated, heavily-escaped literals with a raw triple-quoted string for clarity and easier review.tests/model_explainability/lm_eval/test_lm_eval.py (1)
35-39: Mark LLMAAJ run as slow and pass extended timeout explicitlyLLM-as-a-judge is heavy. Mark this param as slow (or behind an env flag) and, after making
validate_lmeval_job_pod_and_logsconfigurable, pass a longer timeout only for this case.- pytest.param( + pytest.param( {"name": "test-lmeval-hf-llmaaj"}, lmeval_hf_llmaaj_task_data, - id="llmaaj_task", + id="llmaaj_task", + marks=pytest.mark.slow, ),And in the test body:
# Pseudocode if model_namespace["name"] == "test-lmeval-hf-llmaaj": validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod, succeeded_timeout=Timeout.TIMEOUT_40MIN) else: validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod)Alternatively, gate with
pytest.mark.skipif(not os.getenv("RUN_LLMAAJ"), reason="...").
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/model_explainability/lm_eval/constants.py(1 hunks)tests/model_explainability/lm_eval/test_lm_eval.py(2 hunks)tests/model_explainability/lm_eval/utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/model_explainability/lm_eval/utils.py (1)
utilities/constants.py (1)
Timeout(213-224)
🔇 Additional comments (3)
tests/model_explainability/lm_eval/constants.py (1)
1-25: Good extraction of custom task payloadMoving the inline payload into a constant improves reuse and readability.
tests/model_explainability/lm_eval/test_lm_eval.py (2)
4-4: Consolidated imports for reusable task payloadsNice reuse; keeps the test lean.
31-34: Using constant for custom task payloadGood cleanup replacing the inlined dict with
lmeval_hf_custom_task_data.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tests/model_explainability/lm_eval/constants.py (1)
27-31: Constant name: align with project abbreviation (“LLMAJ”).Prior comment suggested LLMAJ_TASK_DATA (single A). Please standardize to avoid import mismatches.
-LLMAAJ_TASK_DATA = { +LLMAJ_TASK_DATA = {If other files already import LLMAAJ_TASK_DATA, update those imports too.
🧹 Nitpick comments (3)
tests/model_explainability/lm_eval/constants.py (3)
10-19: Reduce quoting/escaping risk by using triple-quoted strings.The JSON payload is split across many concatenated literals; easy to introduce subtle quote/escape bugs. Prefer a single triple-quoted string.
Apply:
- "value": '{ "__type__": "input_output_template", ' - '"input_format": "{text_a_type}: {text_a}\\n' - '{text_b_type}: {text_b}", ' - '"output_format": "{label}", ' - '"target_prefix": ' - '"The {type_of_relation} class is ", ' - '"instruction": "Given a {text_a_type} and {text_b_type} ' - 'classify the {type_of_relation} of the {text_b_type} to one of {classes}.",' - ' "postprocessors": [ "processors.take_first_non_empty_line",' - ' "processors.lower_case_till_punc" ] }', + "value": """{ + "__type__": "input_output_template", + "input_format": "{text_a_type}: {text_a}\n{text_b_type}: {text_b}", + "output_format": "{label}", + "target_prefix": "The {type_of_relation} class is ", + "instruction": "Given a {text_a_type} and {text_b_type} classify the {type_of_relation} of the {text_b_type} to one of {classes}.", + "postprocessors": ["processors.take_first_non_empty_line", "processors.lower_case_till_punc"] + }""",
33-45: Template JSON readability and quoting.Same quoting concerns as above; also the string contains apostrophes (Assistant's) and many escapes. Convert to a triple-quoted block.
- "value": '{\n "__type__": "input_output_template",\n "instruction":' - ' "Please act as an impartial judge and evaluate the quality of the ' - "response provided by an AI assistant to the user question displayed below." - " Your evaluation should consider factors such as the helpfulness, relevance," - " accuracy, depth, creativity, and level of detail of the response. Begin your" - " evaluation by providing a short explanation. Be as objective as possible. " - "After providing your explanation, you must rate the response on a scale of 1 to 10" - ' by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: ' - '[[5]]\\".\\n\\n",\n "input_format": "[Question]\\n{question}\\n\\n[The Start ' - "of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]\",\n " - '"output_format": "[[{rating}]]",\n "postprocessors": [\n ' - '"processors.extract_mt_bench_rating_judgment"\n ]\n}\n', + "value": """{ + "__type__": "input_output_template", + "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: [[5]]\\".", + "input_format": "[Question]\\n{question}\\n\\n[The Start of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]", + "output_format": "[[{rating}]]", + "postprocessors": ["processors.extract_mt_bench_rating_judgment"] + }""",
74-103: Optional: simplify the embedded task_card JSON with triple quotes.This long single-line JSON is hard to diff/maintain.
- "custom": '{\n "__type__": "task_card",\n "loader": {...}\n}\n', + "custom": """{ + "__type__": "task_card", + "loader": { + "__type__": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ ... ], + "task": "tasks.response_assessment.rating.single_turn", + "templates": ["templates.response_assessment.rating.mt_bench_single_turn"] + }""",
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/constants.py(1 hunks)tests/model_explainability/lm_eval/test_lm_eval.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/model_explainability/lm_eval/test_lm_eval.py
🔇 Additional comments (3)
tests/model_explainability/lm_eval/constants.py (3)
1-25: Good centralization of Unitxt custom task constants.The structure for CUSTOM_UNITXT_TASK_DATA looks coherent and keeps test params in one place.
61-66: Model/format mismatch risk (Granite model with Mistral format).You’re formatting prompts as Mistral but pointing to a Granite-ish model. Confirm the model actually expects Mistral-style instruction formatting or swap to a tiny Mistral/LLama-compatible model.
Would you like me to propose a tiny HF model and matching format that works offline-friendly?
56-66: Normalize metric config refs and main_score
- Local validation: JSON payloads in the "value" fields are syntactically valid (your validation script printed "OK").
- Issue: "template" uses "templates.*" while "task" is a non-fully-qualified string. Use fully-qualified refs consistently (e.g. "tasks.response_assessment.rating.single_turn" or {"ref": "tasks.response_assessment.rating.single_turn"}).
- Replace the Mistral-specific main_score ("mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn") with a generic metric name used across configs (e.g. "spearman").
Location: tests/model_explainability/lm_eval/constants.py:56-66
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/model_explainability/lm_eval/constants.py (2)
27-27: Rename constant to LLMAJ_TASK_DATA (single ‘A’) for clarity and consistency.Matches “LLM as a Judge” abbreviation and prior review suggestion.
-LLMAAJ_TASK_DATA = { +LLMAJ_TASK_DATA = {Follow up: update imports/uses in tests accordingly.
Run to find references:
#!/bin/bash rg -nP 'LLMAAJ_TASK_DATA|LLMAJ_TASK_DATA' -g '!**/site-packages/**'
69-106: Move template/metrics (and task) out of card; make card.custom an object, not a JSON string.Recipe schema expects
template/metricsas siblings ofcard(andtasktoo). Keeping them insidecardwill be ignored or error. Also setcard.customto a dict instead of a quoted JSON blob to avoid escaping errors."taskRecipes": [ { - "card": { - "custom": '{\n "__type__": "task_card",\n "loader": ' - '{\n "__type__": "load_hf",\n ' - '"path": "OfirArviv/mt_bench_single_score_gpt4_judgement",\n ' - '"split": "train"\n },\n "preprocess_steps": [\n ' - '{\n "__type__": "rename_splits",\n ' - '"mapper": {\n "train": "test"\n }\n },\n ' - '{\n "__type__": "filter_by_condition",\n ' - '"values": {\n "turn": 1\n },\n ' - '"condition": "eq"\n },\n {\n ' - '"__type__": "filter_by_condition",\n ' - '"values": {\n "reference": "[]"\n },\n ' - '"condition": "eq"\n },\n {\n ' - '"__type__": "rename",\n "field_to_field": {\n ' - '"model_input": "question",\n ' - '"score": "rating",\n ' - '"category": "group",\n ' - '"model_output": "answer"\n }\n },\n ' - '{\n "__type__": "literal_eval",\n ' - '"field": "question"\n },\n ' - '{\n "__type__": "copy",\n ' - '"field": "question/0",\n ' - '"to_field": "question"\n },\n ' - '{\n "__type__": "literal_eval",\n ' - '"field": "answer"\n },\n {\n ' - '"__type__": "copy",\n ' - '"field": "answer/0",\n ' - '"to_field": "answer"\n }\n ],\n ' - '"task": "tasks.response_assessment.rating.single_turn",\n ' - '"templates": [\n ' - '"templates.response_assessment.rating.mt_bench_single_turn"\n ]\n}\n', - "template": {"ref": "response_assessment.rating.mt_bench_single_turn"}, - "metrics": [{"ref": "llmaaj_metric"}], - } + "card": { + "custom": { + "__type__": "task_card", + "loader": { + "__type__": "load_hf", + "path": "OfirArviv/mt_bench_single_score_gpt4_judgement", + "split": "train" + }, + "preprocess_steps": [ + {"__type__": "rename_splits", "mapper": {"train": "test"}}, + {"__type__": "filter_by_condition", "values": {"turn": 1}, "condition": "eq"}, + {"__type__": "filter_by_condition", "values": {"reference": "[]"}, "condition": "eq"}, + {"__type__": "rename", "field_to_field": { + "model_input": "question", + "score": "rating", + "category": "group", + "model_output": "answer" + }}, + {"__type__": "literal_eval", "field": "question"}, + {"__type__": "copy", "field": "question/0", "to_field": "question"}, + {"__type__": "literal_eval", "field": "answer"}, + {"__type__": "copy", "field": "answer/0", "to_field": "answer"} + ] + } + }, + "task": {"ref": "response_assessment.rating.single_turn"}, + "template": {"ref": "response_assessment.rating.mt_bench_single_turn"}, + "metrics": [{"ref": "llmaaj_metric"}] } ],
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/constants.py (1)
30-67: Reduce risk of escape errors by using dicts + json.dumps for long JSON payloads.The long JSON-in-string pattern is fragile (already caused one bug). Consider storing Python dicts and serializing with
json.dumps(indent=2)where a string is required by the downstream API.If switching now is too invasive, add a lightweight import-time check:
+import json + # ... inside module, after defining constants: +# Sanity-check embedded JSON strings parse correctly. +json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["templates"][0]["value"]) +json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["tasks"][0]["value"]) +json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["metrics"][0]["value"])
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/model_explainability/lm_eval/constants.py(1 hunks)
🔇 Additional comments (2)
tests/model_explainability/lm_eval/constants.py (2)
1-25: CUSTOM_UNITXT_TASK_DATA structure looks good.Card/template/systemPrompt wiring matches the earlier pattern.
58-66: main_score is a string label — current value is valid.
Unitxt's llm_as_judge expects main_score to be the metric's primary score key (default "llm_as_judge"); confirm your template/postprocessor emits the numeric score under "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn".
|
/cherry-pick 2.24 |
* feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format
|
Cherry pick action created PR #609 successfully 🎉! |
|
Status of building tag latest: success. |
* feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format Co-authored-by: Shelton Cyril <sheltoncyril@gmail.com>
…#584) * feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit