change: swap model and group tasks in LMEval HF tests#394
change: swap model and group tasks in LMEval HF tests#394adolfo-ab merged 1 commit intoopendatahub-io:mainfrom
Conversation
📝 WalkthroughSummary by CodeRabbit
Summary by CodeRabbit
WalkthroughThe changes update the test configuration and parameterization for language model evaluation. The model identifier in a pytest fixture is switched to a different pretrained model, and the test for HuggingFace models is refactored to consolidate multiple task-specific parameters into a single parameter containing a list of tasks. Changes
Suggested labels
Possibly related PRs
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
/verified |
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/build-push-pr-image', '/verified', '/hold', '/wip', '/cherry-pick', '/lgtm'} |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/test_lm_eval.py (1)
51-54: Consider the debugging implications of test consolidation.While the consolidation reduces overhead effectively, running all tasks in a single test may make it harder to identify which specific task fails when debugging issues. The trade-off seems reasonable for integration testing, but consider this impact for future maintenance.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/conftest.py(1 hunks)tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: dbasunag
PR: opendatahub-io/opendatahub-tests#338
File: tests/model_registry/rbac/test_mr_rbac.py:24-53
Timestamp: 2025-06-06T12:22:57.057Z
Learning: In the opendatahub-tests repository, prefer keeping test parameterization configurations inline rather than extracting them to separate variables/constants, as it makes triaging easier by avoiding the need to jump between different parts of the file to understand the test setup.
tests/model_explainability/lm_eval/test_lm_eval.py (1)
Learnt from: dbasunag
PR: opendatahub-io/opendatahub-tests#338
File: tests/model_registry/rbac/test_mr_rbac.py:24-53
Timestamp: 2025-06-06T12:22:57.057Z
Learning: In the opendatahub-tests repository, prefer keeping test parameterization configurations inline rather than extracting them to separate variables/constants, as it makes triaging easier by avoiding the need to jump between different parts of the file to understand the test setup.
🔇 Additional comments (2)
tests/model_explainability/lm_eval/conftest.py (1)
38-38: Model availability and suitability confirmed
- The
rgeada/tiny-untrained-granitemodel returns HTTP 200 on HuggingFace, has a"text-generation"pipeline tag, and the expected transformer format.- With only 17 downloads and an untrained, lightweight profile, it’s ideal for integration and smoke testing without impacting test runtime.
No further action required.
tests/model_explainability/lm_eval/test_lm_eval.py (1)
13-15: Ensure LMEval task name validityThe consolidation in
tests/model_explainability/lm_eval/test_lm_eval.py:13–15reduces overhead and keeps parameterization inline, but the shortened task names may not be recognized by the LMEval system. Please verify that the following task identifiers are valid:
- arc_challenge
- mmlu_astronomy
- hellaswag
- truthfulqa
- winogrande
You can confirm this by checking the LMEval task registry (e.g., via the LMEval CLI or by inspecting its source of truth) to ensure these names are supported.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/conftest.py(1 hunks)tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: dbasunag
PR: opendatahub-io/opendatahub-tests#338
File: tests/model_registry/rbac/test_mr_rbac.py:24-53
Timestamp: 2025-06-06T12:22:57.057Z
Learning: In the opendatahub-tests repository, prefer keeping test parameterization configurations inline rather than extracting them to separate variables/constants, as it makes triaging easier by avoiding the need to jump between different parts of the file to understand the test setup.
tests/model_explainability/lm_eval/test_lm_eval.py (1)
Learnt from: dbasunag
PR: opendatahub-io/opendatahub-tests#338
File: tests/model_registry/rbac/test_mr_rbac.py:24-53
Timestamp: 2025-06-06T12:22:57.057Z
Learning: In the opendatahub-tests repository, prefer keeping test parameterization configurations inline rather than extracting them to separate variables/constants, as it makes triaging easier by avoiding the need to jump between different parts of the file to understand the test setup.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/test_lm_eval.py (1)
51-54: Consider impact on test failure isolation.While the consolidation improves efficiency, running all popular tasks in a single test case means that if one task fails, it may be harder to identify which specific task caused the failure. The current test design will show success/failure for the entire group.
Consider adding more granular logging or error handling within the test to help isolate task-specific issues if they occur. This could be achieved by:
- Adding task-specific assertions or validation steps
- Including task-level logging in the test output
- Or keeping this consolidated approach but ensuring the LMEval framework provides adequate task-level error reporting
This is a minor concern given the significant performance benefit achieved.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/conftest.py(1 hunks)tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)
🔇 Additional comments (1)
tests/model_explainability/lm_eval/test_lm_eval.py (1)
13-16: Please confirm validity of simplified LMEval task names– Excellent consolidation of individual tests into a single “popular_tasks” run.
– To ensure this change won’t break the LMEval integration, please manually verify that the new task names exactly match those in the externallm_evalframework (i.e., that “mmlu_astronomy” and “truthfulqa” are registered tasks and that dropping the_generativesuffix has no unintended side-effects). For example, you can run a quick Python check:from lm_eval import tasks print([name for name in tasks.registry if name in ("mmlu_astronomy", "truthfulqa")])
|
/verified |
|
Status of building tag latest: success. |
Change the model used in LMEval HuggingFace tasks, and group all the popular tasks in a single test.
Description
This collection of tests (LMEval HF), previously took ~25min to run. Most of this time was spent setting up the namespace, and creating and deleting the LMEvalJob CR.
In order to improve this, this PR does 2 things:
with these changes, the total test time goes from ~25min to ~5min
How Has This Been Tested?
Running on PSI
Merge criteria: