[AI Explainability] Add LLM as a Judge test in LMEval by sheltoncyril · Pull Request #584 · opendatahub-io/opendatahub-tests

sheltoncyril · 2025-09-04T19:54:05Z

Description

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Tests
- Added two reusable evaluation task presets (a simple HF custom task and a more advanced LLM-as-judge benchmarking config) and expanded parameterized test coverage to include both flows.
Chores
- Increased job wait timeouts (using duration parsing) to improve stability and reduce flakiness during evaluation runs.

coderabbitai · 2025-09-04T19:54:12Z

📝 Walkthrough

Walkthrough

Adds two LM Eval task configuration constants (CUSTOM_UNITXT_TASK_DATA, LLMAAJ_TASK_DATA), updates tests to import and use them (including a new llmaaj_task case), and changes test utility timeouts to use tts("5m") and tts("1h") for RUNNING and SUCCEEDED pod waits respectively.

Changes

Cohort / File(s)	Summary of changes
LM Eval task constants `tests/model_explainability/lm_eval/constants.py`	Adds `CUSTOM_UNITXT_TASK_DATA` (HF-based custom task with a system prompt and single template linking to a cards.wnli recipe) and `LLMAAJ_TASK_DATA` (LLM-as-judge configuration: templates, tasks, metrics, and a taskRecipe using an HF inference engine and preprocessing steps).
Tests using new constants `tests/model_explainability/lm_eval/test_lm_eval.py`	Imports `CUSTOM_UNITXT_TASK_DATA` and `LLMAAJ_TASK_DATA`; replaces an inlined custom-task payload with `CUSTOM_UNITXT_TASK_DATA`; adds a new parameterized test entry using `LLMAAJ_TASK_DATA` with id `llmaaj_task`.
Utility timeout adjustment `tests/model_explainability/lm_eval/utils.py`	Adds `from pyhelper_utils.general import tts` and replaces `Timeout.TIMEOUT_5MIN` with `tts("5m")` for RUNNING and `Timeout.TIMEOUT_20MIN` with `tts("1h")` for SUCCEEDED in `validate_lmeval_job_pod_and_logs`; surrounding logic and log validation unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "[AI Explainability] Add LLM as a Judge test in LMEval" accurately and concisely summarizes the primary change in the diff — adding an LLM-as-a-judge test configuration and related test data/parameterization for LMEval — and is directly tied to the modified files. It is specific, readable, and appropriate for a teammate scanning PR history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sheltoncyril · 2025-09-04T19:55:45Z

/verified

github-actions · 2025-09-04T20:03:48Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/wip', '/build-push-pr-image', '/cherry-pick', '/verified', '/lgtm', '/hold'}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (6)

tests/model_explainability/lm_eval/utils.py (2)
105-105: 40-min SUCCEEDED wait: make configurable and align with CI timeouts

Bumping to 40 minutes can hide hangs and exceed CI job-level timeouts. Make this per-call configurable and keep the default here, while letting the slow LLMAAJ test pass a larger value. Also, align status access style and avoid mixing lmevaljob_pod.Status and Pod.Status.
-def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod) -> None:
+def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod, succeeded_timeout: int = Timeout.TIMEOUT_20MIN) -> None:
@@
-    lmevaljob_pod.wait_for_status(status=lmevaljob_pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN)
+    lmevaljob_pod.wait_for_status(status=Pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN)
     try:
-        lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=Timeout.TIMEOUT_40MIN)
+        lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=succeeded_timeout)
Action: In the LLMAAJ test, pass succeeded_timeout=Timeout.TIMEOUT_40MIN.

18-26: Docstring default mismatch

Docstring says default is TIMEOUT_2MIN, but function uses TIMEOUT_10MIN.
-        timeout: How long to wait for the pod, defaults to TIMEOUT_2MIN
+        timeout: How long to wait for the pod, defaults to TIMEOUT_10MIN
tests/model_explainability/lm_eval/constants.py (3)
61-62: HF engine fp16 may fail on CPU-only runners

"use_fp16": true can error without CUDA. If CI nodes lack GPUs, gate this via config or provide CPU-safe fallback.

Example:
-        "use_fp16": true
+        "use_fp16": true,
+        "device": "auto"
(Adjust to whatever your inference engine supports.)

80-85: Filtering on string literal "[]": brittle

The dataset’s reference appears to be a serialized list. If format changes, this filter breaks. Prefer literal_eval on reference first, then compare to an empty list.
-        {
-            "__type__": "filter_by_condition",
-            "values": {
-                "reference": "[]"
-            },
-            "condition": "eq"
-        },
+        {
+            "__type__": "literal_eval",
+            "field": "reference"
+        },
+        {
+            "__type__": "filter_by_condition",
+            "values": {
+                "reference": []
+            },
+            "condition": "eq"
+        },
(Assuming filter_by_condition supports non-string values.)

41-44: Limit JSON parsing to template literals and simplify with raw strings
The validation script currently attempts to parse every "value" field as JSON—even plain instruction strings—causing false errors (e.g. Be concise…). Restrict JSON validation to entries whose value begins with { (the actual templates), and consider replacing those concatenated, heavily-escaped literals with a raw triple-quoted string for clarity and easier review.
tests/model_explainability/lm_eval/test_lm_eval.py (1)
35-39: Mark LLMAAJ run as slow and pass extended timeout explicitly

LLM-as-a-judge is heavy. Mark this param as slow (or behind an env flag) and, after making validate_lmeval_job_pod_and_logs configurable, pass a longer timeout only for this case.
-        pytest.param(
+        pytest.param(
             {"name": "test-lmeval-hf-llmaaj"},
             lmeval_hf_llmaaj_task_data,
-            id="llmaaj_task",
+            id="llmaaj_task",
+            marks=pytest.mark.slow,
         ),
And in the test body:
# Pseudocode
if model_namespace["name"] == "test-lmeval-hf-llmaaj":
    validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod, succeeded_timeout=Timeout.TIMEOUT_40MIN)
else:
    validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod)
Alternatively, gate with pytest.mark.skipif(not os.getenv("RUN_LLMAAJ"), reason="...").

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dcff182 and bb4fbe4.

📒 Files selected for processing (3)

tests/model_explainability/lm_eval/constants.py (1 hunks)
tests/model_explainability/lm_eval/test_lm_eval.py (2 hunks)
tests/model_explainability/lm_eval/utils.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/model_explainability/lm_eval/utils.py (1)

utilities/constants.py (1)

Timeout (213-224)

🔇 Additional comments (3)

tests/model_explainability/lm_eval/constants.py (1)

1-25: Good extraction of custom task payload

Moving the inline payload into a constant improves reuse and readability.

tests/model_explainability/lm_eval/test_lm_eval.py (2)

4-4: Consolidated imports for reusable task payloads

Nice reuse; keeps the test lean.

31-34: Using constant for custom task payload

Good cleanup replacing the inlined dict with lmeval_hf_custom_task_data.

tests/model_explainability/lm_eval/constants.py

tests/model_explainability/lm_eval/utils.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/model_explainability/lm_eval/constants.py (1)
27-31: Constant name: align with project abbreviation (“LLMAJ”).

Prior comment suggested LLMAJ_TASK_DATA (single A). Please standardize to avoid import mismatches.
-LLMAAJ_TASK_DATA = {
+LLMAJ_TASK_DATA = {
If other files already import LLMAAJ_TASK_DATA, update those imports too.

🧹 Nitpick comments (3)

tests/model_explainability/lm_eval/constants.py (3)

10-19: Reduce quoting/escaping risk by using triple-quoted strings.

The JSON payload is split across many concatenated literals; easy to introduce subtle quote/escape bugs. Prefer a single triple-quoted string.

Apply:

-                    "value": '{ "__type__": "input_output_template", '
-                    '"input_format": "{text_a_type}: {text_a}\\n'
-                    '{text_b_type}: {text_b}", '
-                    '"output_format": "{label}", '
-                    '"target_prefix": '
-                    '"The {type_of_relation} class is ", '
-                    '"instruction": "Given a {text_a_type} and {text_b_type} '
-                    'classify the {type_of_relation} of the {text_b_type} to one of {classes}.",'
-                    ' "postprocessors": [ "processors.take_first_non_empty_line",'
-                    ' "processors.lower_case_till_punc" ] }',
+                    "value": """{
+                        "__type__": "input_output_template",
+                        "input_format": "{text_a_type}: {text_a}\n{text_b_type}: {text_b}",
+                        "output_format": "{label}",
+                        "target_prefix": "The {type_of_relation} class is ",
+                        "instruction": "Given a {text_a_type} and {text_b_type} classify the {type_of_relation} of the {text_b_type} to one of {classes}.",
+                        "postprocessors": ["processors.take_first_non_empty_line", "processors.lower_case_till_punc"]
+                    }""",

33-45: Template JSON readability and quoting.

Same quoting concerns as above; also the string contains apostrophes (Assistant's) and many escapes. Convert to a triple-quoted block.

-                    "value": '{\n    "__type__": "input_output_template",\n    "instruction":'
-                    ' "Please act as an impartial judge and evaluate the quality of the '
-                    "response provided by an AI assistant to the user question displayed below."
-                    " Your evaluation should consider factors such as the helpfulness, relevance,"
-                    " accuracy, depth, creativity, and level of detail of the response. Begin your"
-                    " evaluation by providing a short explanation. Be as objective as possible. "
-                    "After providing your explanation, you must rate the response on a scale of 1 to 10"
-                    ' by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: '
-                    '[[5]]\\".\\n\\n",\n    "input_format": "[Question]\\n{question}\\n\\n[The Start '
-                    "of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]\",\n    "
-                    '"output_format": "[[{rating}]]",\n    "postprocessors": [\n        '
-                    '"processors.extract_mt_bench_rating_judgment"\n    ]\n}\n',
+                    "value": """{
+                        "__type__": "input_output_template",
+                        "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: [[5]]\\".",
+                        "input_format": "[Question]\\n{question}\\n\\n[The Start of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]",
+                        "output_format": "[[{rating}]]",
+                        "postprocessors": ["processors.extract_mt_bench_rating_judgment"]
+                    }""",

74-103: Optional: simplify the embedded task_card JSON with triple quotes.

This long single-line JSON is hard to diff/maintain.

-                    "custom": '{\n    "__type__": "task_card",\n    "loader": {...}\n}\n',
+                    "custom": """{
+                        "__type__": "task_card",
+                        "loader": {
+                            "__type__": "load_hf",
+                            "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
+                            "split": "train"
+                        },
+                        "preprocess_steps": [ ... ],
+                        "task": "tasks.response_assessment.rating.single_turn",
+                        "templates": ["templates.response_assessment.rating.mt_bench_single_turn"]
+                    }""",

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0a0ac0 and 868fe9a.

📒 Files selected for processing (2)

tests/model_explainability/lm_eval/constants.py (1 hunks)
tests/model_explainability/lm_eval/test_lm_eval.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/model_explainability/lm_eval/test_lm_eval.py

🔇 Additional comments (3)

tests/model_explainability/lm_eval/constants.py (3)

1-25: Good centralization of Unitxt custom task constants.

The structure for CUSTOM_UNITXT_TASK_DATA looks coherent and keeps test params in one place.

61-66: Model/format mismatch risk (Granite model with Mistral format).

You’re formatting prompts as Mistral but pointing to a Granite-ish model. Confirm the model actually expects Mistral-style instruction formatting or swap to a tiny Mistral/LLama-compatible model.

Would you like me to propose a tiny HF model and matching format that works offline-friendly?

56-66: Normalize metric config refs and main_score

Local validation: JSON payloads in the "value" fields are syntactically valid (your validation script printed "OK").

Issue: "template" uses "templates.*" while "task" is a non-fully-qualified string. Use fully-qualified refs consistently (e.g. "tasks.response_assessment.rating.single_turn" or {"ref": "tasks.response_assessment.rating.single_turn"}).

Replace the Mistral-specific main_score ("mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn") with a generic metric name used across configs (e.g. "spearman").

Location: tests/model_explainability/lm_eval/constants.py:56-66

tests/model_explainability/lm_eval/constants.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/model_explainability/lm_eval/constants.py (2)

27-27: Rename constant to LLMAJ_TASK_DATA (single ‘A’) for clarity and consistency.

Matches “LLM as a Judge” abbreviation and prior review suggestion.

-LLMAAJ_TASK_DATA = {
+LLMAJ_TASK_DATA = {

Follow up: update imports/uses in tests accordingly.

Run to find references:

#!/bin/bash
rg -nP 'LLMAAJ_TASK_DATA|LLMAJ_TASK_DATA' -g '!**/site-packages/**'

69-106: Move template/metrics (and task) out of card; make card.custom an object, not a JSON string.

Recipe schema expects template/metrics as siblings of card (and task too). Keeping them inside card will be ignored or error. Also set card.custom to a dict instead of a quoted JSON blob to avoid escaping errors.

         "taskRecipes": [
             {
-                "card": {
-                    "custom": '{\n    "__type__": "task_card",\n    "loader": '
-                    '{\n        "__type__": "load_hf",\n        '
-                    '"path": "OfirArviv/mt_bench_single_score_gpt4_judgement",\n        '
-                    '"split": "train"\n    },\n    "preprocess_steps": [\n        '
-                    '{\n            "__type__": "rename_splits",\n            '
-                    '"mapper": {\n                "train": "test"\n            }\n        },\n        '
-                    '{\n            "__type__": "filter_by_condition",\n            '
-                    '"values": {\n                "turn": 1\n            },\n            '
-                    '"condition": "eq"\n        },\n        {\n            '
-                    '"__type__": "filter_by_condition",\n            '
-                    '"values": {\n                "reference": "[]"\n            },\n            '
-                    '"condition": "eq"\n        },\n        {\n            '
-                    '"__type__": "rename",\n            "field_to_field": {\n                '
-                    '"model_input": "question",\n                '
-                    '"score": "rating",\n                '
-                    '"category": "group",\n                '
-                    '"model_output": "answer"\n            }\n        },\n        '
-                    '{\n            "__type__": "literal_eval",\n            '
-                    '"field": "question"\n        },\n        '
-                    '{\n            "__type__": "copy",\n            '
-                    '"field": "question/0",\n            '
-                    '"to_field": "question"\n        },\n        '
-                    '{\n            "__type__": "literal_eval",\n            '
-                    '"field": "answer"\n        },\n        {\n            '
-                    '"__type__": "copy",\n            '
-                    '"field": "answer/0",\n            '
-                    '"to_field": "answer"\n        }\n    ],\n    '
-                    '"task": "tasks.response_assessment.rating.single_turn",\n    '
-                    '"templates": [\n        '
-                    '"templates.response_assessment.rating.mt_bench_single_turn"\n    ]\n}\n',
-                    "template": {"ref": "response_assessment.rating.mt_bench_single_turn"},
-                    "metrics": [{"ref": "llmaaj_metric"}],
-                }
+                "card": {
+                    "custom": {
+                        "__type__": "task_card",
+                        "loader": {
+                            "__type__": "load_hf",
+                            "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
+                            "split": "train"
+                        },
+                        "preprocess_steps": [
+                            {"__type__": "rename_splits", "mapper": {"train": "test"}},
+                            {"__type__": "filter_by_condition", "values": {"turn": 1}, "condition": "eq"},
+                            {"__type__": "filter_by_condition", "values": {"reference": "[]"}, "condition": "eq"},
+                            {"__type__": "rename", "field_to_field": {
+                                "model_input": "question",
+                                "score": "rating",
+                                "category": "group",
+                                "model_output": "answer"
+                            }},
+                            {"__type__": "literal_eval", "field": "question"},
+                            {"__type__": "copy", "field": "question/0", "to_field": "question"},
+                            {"__type__": "literal_eval", "field": "answer"},
+                            {"__type__": "copy", "field": "answer/0", "to_field": "answer"}
+                        ]
+                    }
+                },
+                "task": {"ref": "response_assessment.rating.single_turn"},
+                "template": {"ref": "response_assessment.rating.mt_bench_single_turn"},
+                "metrics": [{"ref": "llmaaj_metric"}]
             }
         ],

🧹 Nitpick comments (1)

tests/model_explainability/lm_eval/constants.py (1)
30-67: Reduce risk of escape errors by using dicts + json.dumps for long JSON payloads.

The long JSON-in-string pattern is fragile (already caused one bug). Consider storing Python dicts and serializing with json.dumps(indent=2) where a string is required by the downstream API.

If switching now is too invasive, add a lightweight import-time check:
+import json
+
 # ... inside module, after defining constants:
+# Sanity-check embedded JSON strings parse correctly.
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["templates"][0]["value"])
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["tasks"][0]["value"])
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["metrics"][0]["value"])

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 868fe9a and fb38387.

📒 Files selected for processing (1)

tests/model_explainability/lm_eval/constants.py (1 hunks)

🔇 Additional comments (2)

tests/model_explainability/lm_eval/constants.py (2)

1-25: CUSTOM_UNITXT_TASK_DATA structure looks good.

Card/template/systemPrompt wiring matches the earlier pattern.

58-66: main_score is a string label — current value is valid.
Unitxt's llm_as_judge expects main_score to be the metric's primary score key (default "llm_as_judge"); confirm your template/postprocessor emits the numeric score under "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn".

tests/model_explainability/lm_eval/constants.py

sheltoncyril · 2025-09-15T12:38:33Z

/cherry-pick 2.24

* feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format

rhods-ci-bot · 2025-09-15T12:38:50Z

Cherry pick action created PR #609 successfully 🎉!
See: https://github.com/opendatahub-io/opendatahub-tests/actions/runs/17733318337

github-actions · 2025-09-15T12:39:41Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

* feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format Co-authored-by: Shelton Cyril <sheltoncyril@gmail.com>

…#584) * feat: add llmaaj initial * feat: add LLMAAJ * feat: increase Timeout for LMEval tasks * fix: task name in metrics * feat: increase timeout for lmeval task * fix: change model name * fix: akctually remove model format

sheltoncyril added 3 commits August 27, 2025 13:08

feat: add llmaaj initial

1203ceb

feat: add LLMAAJ

23c4271

feat: increase Timeout for LMEval tasks

38fefc9

sheltoncyril requested a review from adolfo-ab as a code owner September 4, 2025 19:54

Merge branch 'main' into LLMaaJ-AI-Exp

bb4fbe4

github-actions bot added ModelExplainability size/l labels Sep 4, 2025

github-actions bot assigned sheltoncyril Sep 4, 2025

coderabbitai bot reviewed Sep 4, 2025

View reviewed changes

dbasunag previously approved these changes Sep 4, 2025

View reviewed changes

rhods-ci-bot added Verified Verified pr in Jenkins lgtm-by-dbasunag labels Sep 4, 2025

fix: task name in metrics

f023ec7

sheltoncyril dismissed dbasunag’s stale review via f023ec7 September 4, 2025 21:04

opendatahub-io deleted a comment from coderabbitai bot Sep 4, 2025

rhods-ci-bot removed Verified Verified pr in Jenkins lgtm-by-dbasunag labels Sep 4, 2025

feat: increase timeout for lmeval task

c0a0ac0

adolfo-ab suggested changes Sep 5, 2025

View reviewed changes

tests/model_explainability/lm_eval/constants.py Outdated Show resolved Hide resolved

tests/model_explainability/lm_eval/constants.py Outdated Show resolved Hide resolved

tests/model_explainability/lm_eval/utils.py Show resolved Hide resolved

rhods-ci-bot added changes-requested-by-adolfo-ab commented-by-sheltoncyril labels Sep 5, 2025

fix: change model name

868fe9a

rhods-ci-bot removed changes-requested-by-adolfo-ab commented-by-sheltoncyril labels Sep 11, 2025

coderabbitai bot reviewed Sep 11, 2025

View reviewed changes

tests/model_explainability/lm_eval/constants.py Show resolved Hide resolved

fix: akctually remove model format

fb38387

coderabbitai bot reviewed Sep 12, 2025

View reviewed changes

tests/model_explainability/lm_eval/constants.py Show resolved Hide resolved

adolfo-ab approved these changes Sep 15, 2025

View reviewed changes

rhods-ci-bot added the lgtm-by-adolfo-ab label Sep 15, 2025

dbasunag approved these changes Sep 15, 2025

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Sep 15, 2025

Merge branch 'main' into LLMaaJ-AI-Exp

7688738

sheltoncyril enabled auto-merge (squash) September 15, 2025 12:27

rhods-ci-bot removed lgtm-by-adolfo-ab lgtm-by-dbasunag labels Sep 15, 2025

sheltoncyril disabled auto-merge September 15, 2025 12:27

sheltoncyril enabled auto-merge (squash) September 15, 2025 12:27

sheltoncyril merged commit 50ab82e into opendatahub-io:main Sep 15, 2025
8 checks passed

rhods-ci-bot mentioned this pull request Sep 15, 2025

[Cherry-pick][2.24] [AI Explainability] Add LLM as a Judge test in LMEval #609

Merged

rhods-ci-bot added cherry-pick commented-by-rhods-ci-bot labels Sep 15, 2025

Conversation

sheltoncyril commented Sep 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks (3 passed)

Uh oh!

sheltoncyril commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sheltoncyril commented Sep 15, 2025

Uh oh!

rhods-ci-bot commented Sep 15, 2025

Uh oh!

github-actions bot commented Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sheltoncyril commented Sep 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 4, 2025 •

edited

Loading