Skip to content

[AI Explainability] Add LLM as a Judge test in LMEval#584

Merged
sheltoncyril merged 9 commits intoopendatahub-io:mainfrom
sheltoncyril:LLMaaJ-AI-Exp
Sep 15, 2025
Merged

[AI Explainability] Add LLM as a Judge test in LMEval#584
sheltoncyril merged 9 commits intoopendatahub-io:mainfrom
sheltoncyril:LLMaaJ-AI-Exp

Conversation

@sheltoncyril
Copy link
Copy Markdown
Contributor

@sheltoncyril sheltoncyril commented Sep 4, 2025

Description

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Tests
    • Added two reusable evaluation task presets (a simple HF custom task and a more advanced LLM-as-judge benchmarking config) and expanded parameterized test coverage to include both flows.
  • Chores
    • Increased job wait timeouts (using duration parsing) to improve stability and reduce flakiness during evaluation runs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 4, 2025

📝 Walkthrough

Walkthrough

Adds two LM Eval task configuration constants (CUSTOM_UNITXT_TASK_DATA, LLMAAJ_TASK_DATA), updates tests to import and use them (including a new llmaaj_task case), and changes test utility timeouts to use tts("5m") and tts("1h") for RUNNING and SUCCEEDED pod waits respectively.

Changes

Cohort / File(s) Summary of changes
LM Eval task constants
tests/model_explainability/lm_eval/constants.py
Adds CUSTOM_UNITXT_TASK_DATA (HF-based custom task with a system prompt and single template linking to a cards.wnli recipe) and LLMAAJ_TASK_DATA (LLM-as-judge configuration: templates, tasks, metrics, and a taskRecipe using an HF inference engine and preprocessing steps).
Tests using new constants
tests/model_explainability/lm_eval/test_lm_eval.py
Imports CUSTOM_UNITXT_TASK_DATA and LLMAAJ_TASK_DATA; replaces an inlined custom-task payload with CUSTOM_UNITXT_TASK_DATA; adds a new parameterized test entry using LLMAAJ_TASK_DATA with id llmaaj_task.
Utility timeout adjustment
tests/model_explainability/lm_eval/utils.py
Adds from pyhelper_utils.general import tts and replaces Timeout.TIMEOUT_5MIN with tts("5m") for RUNNING and Timeout.TIMEOUT_20MIN with tts("1h") for SUCCEEDED in validate_lmeval_job_pod_and_logs; surrounding logic and log validation unchanged.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks (3 passed)

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[AI Explainability] Add LLM as a Judge test in LMEval" accurately and concisely summarizes the primary change in the diff — adding an LLM-as-a-judge test configuration and related test data/parameterization for LMEval — and is directly tied to the modified files. It is specific, readable, and appropriate for a teammate scanning PR history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sheltoncyril
Copy link
Copy Markdown
Contributor Author

/verified

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 4, 2025

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/wip', '/build-push-pr-image', '/cherry-pick', '/verified', '/lgtm', '/hold'}

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
tests/model_explainability/lm_eval/utils.py (2)

105-105: 40-min SUCCEEDED wait: make configurable and align with CI timeouts

Bumping to 40 minutes can hide hangs and exceed CI job-level timeouts. Make this per-call configurable and keep the default here, while letting the slow LLMAAJ test pass a larger value. Also, align status access style and avoid mixing lmevaljob_pod.Status and Pod.Status.

-def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod) -> None:
+def validate_lmeval_job_pod_and_logs(lmevaljob_pod: Pod, succeeded_timeout: int = Timeout.TIMEOUT_20MIN) -> None:
@@
-    lmevaljob_pod.wait_for_status(status=lmevaljob_pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN)
+    lmevaljob_pod.wait_for_status(status=Pod.Status.RUNNING, timeout=Timeout.TIMEOUT_5MIN)
     try:
-        lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=Timeout.TIMEOUT_40MIN)
+        lmevaljob_pod.wait_for_status(status=Pod.Status.SUCCEEDED, timeout=succeeded_timeout)

Action: In the LLMAAJ test, pass succeeded_timeout=Timeout.TIMEOUT_40MIN.


18-26: Docstring default mismatch

Docstring says default is TIMEOUT_2MIN, but function uses TIMEOUT_10MIN.

-        timeout: How long to wait for the pod, defaults to TIMEOUT_2MIN
+        timeout: How long to wait for the pod, defaults to TIMEOUT_10MIN
tests/model_explainability/lm_eval/constants.py (3)

61-62: HF engine fp16 may fail on CPU-only runners

"use_fp16": true can error without CUDA. If CI nodes lack GPUs, gate this via config or provide CPU-safe fallback.

Example:

-        "use_fp16": true
+        "use_fp16": true,
+        "device": "auto"

(Adjust to whatever your inference engine supports.)


80-85: Filtering on string literal "[]": brittle

The dataset’s reference appears to be a serialized list. If format changes, this filter breaks. Prefer literal_eval on reference first, then compare to an empty list.

-        {
-            "__type__": "filter_by_condition",
-            "values": {
-                "reference": "[]"
-            },
-            "condition": "eq"
-        },
+        {
+            "__type__": "literal_eval",
+            "field": "reference"
+        },
+        {
+            "__type__": "filter_by_condition",
+            "values": {
+                "reference": []
+            },
+            "condition": "eq"
+        },

(Assuming filter_by_condition supports non-string values.)


41-44: Limit JSON parsing to template literals and simplify with raw strings
The validation script currently attempts to parse every "value" field as JSON—even plain instruction strings—causing false errors (e.g. Be concise…). Restrict JSON validation to entries whose value begins with { (the actual templates), and consider replacing those concatenated, heavily-escaped literals with a raw triple-quoted string for clarity and easier review.

tests/model_explainability/lm_eval/test_lm_eval.py (1)

35-39: Mark LLMAAJ run as slow and pass extended timeout explicitly

LLM-as-a-judge is heavy. Mark this param as slow (or behind an env flag) and, after making validate_lmeval_job_pod_and_logs configurable, pass a longer timeout only for this case.

-        pytest.param(
+        pytest.param(
             {"name": "test-lmeval-hf-llmaaj"},
             lmeval_hf_llmaaj_task_data,
-            id="llmaaj_task",
+            id="llmaaj_task",
+            marks=pytest.mark.slow,
         ),

And in the test body:

# Pseudocode
if model_namespace["name"] == "test-lmeval-hf-llmaaj":
    validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod, succeeded_timeout=Timeout.TIMEOUT_40MIN)
else:
    validate_lmeval_job_pod_and_logs(lmevaljob_hf_pod)

Alternatively, gate with pytest.mark.skipif(not os.getenv("RUN_LLMAAJ"), reason="...").

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dcff182 and bb4fbe4.

📒 Files selected for processing (3)
  • tests/model_explainability/lm_eval/constants.py (1 hunks)
  • tests/model_explainability/lm_eval/test_lm_eval.py (2 hunks)
  • tests/model_explainability/lm_eval/utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/model_explainability/lm_eval/utils.py (1)
utilities/constants.py (1)
  • Timeout (213-224)
🔇 Additional comments (3)
tests/model_explainability/lm_eval/constants.py (1)

1-25: Good extraction of custom task payload

Moving the inline payload into a constant improves reuse and readability.

tests/model_explainability/lm_eval/test_lm_eval.py (2)

4-4: Consolidated imports for reusable task payloads

Nice reuse; keeps the test lean.


31-34: Using constant for custom task payload

Good cleanup replacing the inlined dict with lmeval_hf_custom_task_data.

dbasunag
dbasunag previously approved these changes Sep 4, 2025
@rhods-ci-bot rhods-ci-bot added Verified Verified pr in Jenkins lgtm-by-dbasunag labels Sep 4, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/model_explainability/lm_eval/constants.py (1)

27-31: Constant name: align with project abbreviation (“LLMAJ”).

Prior comment suggested LLMAJ_TASK_DATA (single A). Please standardize to avoid import mismatches.

-LLMAAJ_TASK_DATA = {
+LLMAJ_TASK_DATA = {

If other files already import LLMAAJ_TASK_DATA, update those imports too.

🧹 Nitpick comments (3)
tests/model_explainability/lm_eval/constants.py (3)

10-19: Reduce quoting/escaping risk by using triple-quoted strings.

The JSON payload is split across many concatenated literals; easy to introduce subtle quote/escape bugs. Prefer a single triple-quoted string.

Apply:

-                    "value": '{ "__type__": "input_output_template", '
-                    '"input_format": "{text_a_type}: {text_a}\\n'
-                    '{text_b_type}: {text_b}", '
-                    '"output_format": "{label}", '
-                    '"target_prefix": '
-                    '"The {type_of_relation} class is ", '
-                    '"instruction": "Given a {text_a_type} and {text_b_type} '
-                    'classify the {type_of_relation} of the {text_b_type} to one of {classes}.",'
-                    ' "postprocessors": [ "processors.take_first_non_empty_line",'
-                    ' "processors.lower_case_till_punc" ] }',
+                    "value": """{
+                        "__type__": "input_output_template",
+                        "input_format": "{text_a_type}: {text_a}\n{text_b_type}: {text_b}",
+                        "output_format": "{label}",
+                        "target_prefix": "The {type_of_relation} class is ",
+                        "instruction": "Given a {text_a_type} and {text_b_type} classify the {type_of_relation} of the {text_b_type} to one of {classes}.",
+                        "postprocessors": ["processors.take_first_non_empty_line", "processors.lower_case_till_punc"]
+                    }""",

33-45: Template JSON readability and quoting.

Same quoting concerns as above; also the string contains apostrophes (Assistant's) and many escapes. Convert to a triple-quoted block.

-                    "value": '{\n    "__type__": "input_output_template",\n    "instruction":'
-                    ' "Please act as an impartial judge and evaluate the quality of the '
-                    "response provided by an AI assistant to the user question displayed below."
-                    " Your evaluation should consider factors such as the helpfulness, relevance,"
-                    " accuracy, depth, creativity, and level of detail of the response. Begin your"
-                    " evaluation by providing a short explanation. Be as objective as possible. "
-                    "After providing your explanation, you must rate the response on a scale of 1 to 10"
-                    ' by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: '
-                    '[[5]]\\".\\n\\n",\n    "input_format": "[Question]\\n{question}\\n\\n[The Start '
-                    "of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]\",\n    "
-                    '"output_format": "[[{rating}]]",\n    "postprocessors": [\n        '
-                    '"processors.extract_mt_bench_rating_judgment"\n    ]\n}\n',
+                    "value": """{
+                        "__type__": "input_output_template",
+                        "instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \\"[[rating]]\\", for example: \\"Rating: [[5]]\\".",
+                        "input_format": "[Question]\\n{question}\\n\\n[The Start of Assistant's Answer]\\n{answer}\\n[The End of Assistant's Answer]",
+                        "output_format": "[[{rating}]]",
+                        "postprocessors": ["processors.extract_mt_bench_rating_judgment"]
+                    }""",

74-103: Optional: simplify the embedded task_card JSON with triple quotes.

This long single-line JSON is hard to diff/maintain.

-                    "custom": '{\n    "__type__": "task_card",\n    "loader": {...}\n}\n',
+                    "custom": """{
+                        "__type__": "task_card",
+                        "loader": {
+                            "__type__": "load_hf",
+                            "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
+                            "split": "train"
+                        },
+                        "preprocess_steps": [ ... ],
+                        "task": "tasks.response_assessment.rating.single_turn",
+                        "templates": ["templates.response_assessment.rating.mt_bench_single_turn"]
+                    }""",
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0a0ac0 and 868fe9a.

📒 Files selected for processing (2)
  • tests/model_explainability/lm_eval/constants.py (1 hunks)
  • tests/model_explainability/lm_eval/test_lm_eval.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/model_explainability/lm_eval/test_lm_eval.py
🔇 Additional comments (3)
tests/model_explainability/lm_eval/constants.py (3)

1-25: Good centralization of Unitxt custom task constants.

The structure for CUSTOM_UNITXT_TASK_DATA looks coherent and keeps test params in one place.


61-66: Model/format mismatch risk (Granite model with Mistral format).

You’re formatting prompts as Mistral but pointing to a Granite-ish model. Confirm the model actually expects Mistral-style instruction formatting or swap to a tiny Mistral/LLama-compatible model.

Would you like me to propose a tiny HF model and matching format that works offline-friendly?


56-66: Normalize metric config refs and main_score

  • Local validation: JSON payloads in the "value" fields are syntactically valid (your validation script printed "OK").
  • Issue: "template" uses "templates.*" while "task" is a non-fully-qualified string. Use fully-qualified refs consistently (e.g. "tasks.response_assessment.rating.single_turn" or {"ref": "tasks.response_assessment.rating.single_turn"}).
  • Replace the Mistral-specific main_score ("mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn") with a generic metric name used across configs (e.g. "spearman").

Location: tests/model_explainability/lm_eval/constants.py:56-66

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/model_explainability/lm_eval/constants.py (2)

27-27: Rename constant to LLMAJ_TASK_DATA (single ‘A’) for clarity and consistency.

Matches “LLM as a Judge” abbreviation and prior review suggestion.

-LLMAAJ_TASK_DATA = {
+LLMAJ_TASK_DATA = {

Follow up: update imports/uses in tests accordingly.

Run to find references:

#!/bin/bash
rg -nP 'LLMAAJ_TASK_DATA|LLMAJ_TASK_DATA' -g '!**/site-packages/**'

69-106: Move template/metrics (and task) out of card; make card.custom an object, not a JSON string.

Recipe schema expects template/metrics as siblings of card (and task too). Keeping them inside card will be ignored or error. Also set card.custom to a dict instead of a quoted JSON blob to avoid escaping errors.

         "taskRecipes": [
             {
-                "card": {
-                    "custom": '{\n    "__type__": "task_card",\n    "loader": '
-                    '{\n        "__type__": "load_hf",\n        '
-                    '"path": "OfirArviv/mt_bench_single_score_gpt4_judgement",\n        '
-                    '"split": "train"\n    },\n    "preprocess_steps": [\n        '
-                    '{\n            "__type__": "rename_splits",\n            '
-                    '"mapper": {\n                "train": "test"\n            }\n        },\n        '
-                    '{\n            "__type__": "filter_by_condition",\n            '
-                    '"values": {\n                "turn": 1\n            },\n            '
-                    '"condition": "eq"\n        },\n        {\n            '
-                    '"__type__": "filter_by_condition",\n            '
-                    '"values": {\n                "reference": "[]"\n            },\n            '
-                    '"condition": "eq"\n        },\n        {\n            '
-                    '"__type__": "rename",\n            "field_to_field": {\n                '
-                    '"model_input": "question",\n                '
-                    '"score": "rating",\n                '
-                    '"category": "group",\n                '
-                    '"model_output": "answer"\n            }\n        },\n        '
-                    '{\n            "__type__": "literal_eval",\n            '
-                    '"field": "question"\n        },\n        '
-                    '{\n            "__type__": "copy",\n            '
-                    '"field": "question/0",\n            '
-                    '"to_field": "question"\n        },\n        '
-                    '{\n            "__type__": "literal_eval",\n            '
-                    '"field": "answer"\n        },\n        {\n            '
-                    '"__type__": "copy",\n            '
-                    '"field": "answer/0",\n            '
-                    '"to_field": "answer"\n        }\n    ],\n    '
-                    '"task": "tasks.response_assessment.rating.single_turn",\n    '
-                    '"templates": [\n        '
-                    '"templates.response_assessment.rating.mt_bench_single_turn"\n    ]\n}\n',
-                    "template": {"ref": "response_assessment.rating.mt_bench_single_turn"},
-                    "metrics": [{"ref": "llmaaj_metric"}],
-                }
+                "card": {
+                    "custom": {
+                        "__type__": "task_card",
+                        "loader": {
+                            "__type__": "load_hf",
+                            "path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
+                            "split": "train"
+                        },
+                        "preprocess_steps": [
+                            {"__type__": "rename_splits", "mapper": {"train": "test"}},
+                            {"__type__": "filter_by_condition", "values": {"turn": 1}, "condition": "eq"},
+                            {"__type__": "filter_by_condition", "values": {"reference": "[]"}, "condition": "eq"},
+                            {"__type__": "rename", "field_to_field": {
+                                "model_input": "question",
+                                "score": "rating",
+                                "category": "group",
+                                "model_output": "answer"
+                            }},
+                            {"__type__": "literal_eval", "field": "question"},
+                            {"__type__": "copy", "field": "question/0", "to_field": "question"},
+                            {"__type__": "literal_eval", "field": "answer"},
+                            {"__type__": "copy", "field": "answer/0", "to_field": "answer"}
+                        ]
+                    }
+                },
+                "task": {"ref": "response_assessment.rating.single_turn"},
+                "template": {"ref": "response_assessment.rating.mt_bench_single_turn"},
+                "metrics": [{"ref": "llmaaj_metric"}]
             }
         ],
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/constants.py (1)

30-67: Reduce risk of escape errors by using dicts + json.dumps for long JSON payloads.

The long JSON-in-string pattern is fragile (already caused one bug). Consider storing Python dicts and serializing with json.dumps(indent=2) where a string is required by the downstream API.

If switching now is too invasive, add a lightweight import-time check:

+import json
+
 # ... inside module, after defining constants:
+# Sanity-check embedded JSON strings parse correctly.
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["templates"][0]["value"])
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["tasks"][0]["value"])
+json.loads(LLMAJ_TASK_DATA["task_list"]["custom"]["metrics"][0]["value"])
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 868fe9a and fb38387.

📒 Files selected for processing (1)
  • tests/model_explainability/lm_eval/constants.py (1 hunks)
🔇 Additional comments (2)
tests/model_explainability/lm_eval/constants.py (2)

1-25: CUSTOM_UNITXT_TASK_DATA structure looks good.

Card/template/systemPrompt wiring matches the earlier pattern.


58-66: main_score is a string label — current value is valid.
Unitxt's llm_as_judge expects main_score to be the metric's primary score key (default "llm_as_judge"); confirm your template/postprocessor emits the numeric score under "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn".

@sheltoncyril sheltoncyril enabled auto-merge (squash) September 15, 2025 12:27
@sheltoncyril sheltoncyril enabled auto-merge (squash) September 15, 2025 12:27
@sheltoncyril sheltoncyril merged commit 50ab82e into opendatahub-io:main Sep 15, 2025
8 checks passed
@sheltoncyril
Copy link
Copy Markdown
Contributor Author

/cherry-pick 2.24

rhods-ci-bot pushed a commit that referenced this pull request Sep 15, 2025
* feat: add llmaaj initial

* feat: add LLMAAJ

* feat: increase Timeout for LMEval tasks

* fix: task name in metrics

* feat: increase timeout for lmeval task

* fix: change model name

* fix: akctually remove model format
@rhods-ci-bot
Copy link
Copy Markdown
Contributor

Cherry pick action created PR #609 successfully 🎉!
See: https://github.com/opendatahub-io/opendatahub-tests/actions/runs/17733318337

@github-actions
Copy link
Copy Markdown

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

sheltoncyril added a commit that referenced this pull request Sep 15, 2025
* feat: add llmaaj initial

* feat: add LLMAAJ

* feat: increase Timeout for LMEval tasks

* fix: task name in metrics

* feat: increase timeout for lmeval task

* fix: change model name

* fix: akctually remove model format

Co-authored-by: Shelton Cyril <sheltoncyril@gmail.com>
mwaykole pushed a commit to mwaykole/opendatahub-tests that referenced this pull request Jan 23, 2026
…#584)

* feat: add llmaaj initial

* feat: add LLMAAJ

* feat: increase Timeout for LMEval tasks

* fix: task name in metrics

* feat: increase timeout for lmeval task

* fix: change model name

* fix: akctually remove model format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants