feat: Add systematic testing of Tier 2 LMEval tasks by christinaexyou · Pull Request #478 · opendatahub-io/opendatahub-tests

christinaexyou · 2025-07-30T18:50:11Z

Description

This PR adds testing for Tier 2 tasks for LM-Eval. Tier 2 tasks are defined by tasks that are in the 70th percentile of downloads on HuggingFace but have less than 10,000 downloads.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

coderabbitai · 2025-07-30T18:50:17Z

📝 Walkthrough

Summary by CodeRabbit

Tests
- Updated test suite for LMEval tasks by splitting tasks into two tiers based on download counts, with separate test cases for each tier.
Refactor
- Enhanced task selection logic to support both minimum and maximum download thresholds, allowing for more flexible filtering by download count percentiles or absolute values.

Walkthrough

The changes split the LMEval task list into two tiers based on download counts, updating test logic to run each tier separately. The get_lmeval_tasks utility function was enhanced to support both minimum and maximum download thresholds, including percentile-based filtering. Test parameterization and variable naming were adjusted accordingly.

Changes

Cohort / File(s)	Change Summary
Test Tiering and Parameterization `tests/model_explainability/lm_eval/test_lm_eval.py`	Split the LMEval task list into `TIER1_LMEVAL_TASKS` and `TIER2_LMEVAL_TASKS` based on download thresholds; updated test parameterization to run tests for each tier with distinct model namespace names.
Download Filtering Utility `tests/model_explainability/lm_eval/utils.py`	Extended `get_lmeval_tasks` to accept both `min_downloads` and `max_downloads` (int, float, or None), supporting percentile-based filtering; updated filtering logic and validation; adjusted import statements for new parameter types.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15 minutes

Possibly related PRs

feat: Add systematic testing of LMEval tasks #380: Introduces the initial get_lmeval_tasks function and uses it to parameterize tests; this PR directly builds on and modifies that function for tiered task filtering.

Suggested labels

Verified, size/xxl, Utilities

Suggested reviewers

adolfo-ab
dbasunag

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

github-actions · 2025-07-30T18:50:25Z

The following are automatically added/executed:

PR size label.
Run pre-commit
Run tox
Add PR author as the PR assignee
Build image based on the PR

Available user actions:

To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
lgtm label removed on each new commit push.
To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
verified label removed on each new commit push.
To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.

Supported labels

{'/verified', '/wip', '/lgtm', '/hold', '/build-push-pr-image', '/cherry-pick'}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

tests/model_explainability/lm_eval/utils.py (1)
80-80: Update logging message to reflect complete filtering criteria.

The log message only mentions min_downloads but the filtering now also considers max_downloads when provided.

Apply this diff to improve the logging:
-    LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")
+    if max_downloads is not None:
+        LOGGER.info(f"Number of unique LMEval tasks with {min_downloads}-{max_downloads} downloads: {len(unique_tasks)}")
+    else:
+        LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cd3acaa and 2a875a7.

📒 Files selected for processing (2)

tests/model_explainability/lm_eval/test_lm_eval.py (1 hunks)
tests/model_explainability/lm_eval/utils.py (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

tests/model_explainability/lm_eval/test_lm_eval.py (1)

tests/model_explainability/lm_eval/utils.py (1)

get_lmeval_tasks (39-82)

🔇 Additional comments (5)

tests/model_explainability/lm_eval/utils.py (3)

1-1: LGTM!

The addition of Union to the typing imports is necessary for the enhanced function signature.

39-49: Well-designed function signature enhancement.

The updated signature appropriately uses Union types to support both absolute download counts and percentile-based filtering. The documentation clearly explains the dual functionality of the parameters.

58-64: Clear and accurate filtering logic.

The updated comments accurately describe the enhanced filtering criteria, maintaining both download-based and OpenLLM leaderboard-based selection.

tests/model_explainability/lm_eval/test_lm_eval.py (2)

12-12: LGTM!

TIER1_LMEVAL_TASKS correctly maintains the original filtering behavior for high-download tasks.

20-27: Well-structured test parameterization for tiered testing.

The test parameters are properly structured to support separate execution of tier 1 and tier 2 tasks, with clear naming conventions for the namespaces.

dbasunag · 2025-08-01T20:36:33Z



-def get_lmeval_tasks(min_downloads: int = 10000) -> List[str]:
+def get_lmeval_tasks(min_downloads: Union[int, float], max_downloads: Union[int, float, None] = None) -> List[str]:


Please use |. We are using python 3.13, so we don't need to use Union and | is recommended now.

dbasunag · 2025-08-01T20:38:31Z

+    # if max_downloads is provided, filter for tasks that have less than
+    # or equal to the maximum number of downloads
+    if max_downloads is not None:
+        if max_downloads <= 0 | max_downloads > max(lmeval_tasks["HF dataset downloads"]):


please use "or"

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad3d3d and 223160c.

📒 Files selected for processing (2)

tests/model_explainability/lm_eval/test_lm_eval.py (1 hunks)
tests/model_explainability/lm_eval/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/model_explainability/lm_eval/test_lm_eval.py

🔇 Additional comments (4)

tests/model_explainability/lm_eval/utils.py (4)

39-39: LGTM!

The function signature correctly uses modern Python union syntax and appropriately allows both integer and float types for download thresholds.

44-45: LGTM!

The parameter documentation clearly explains the dual purpose of the parameters (download counts vs percentiles).

50-58: LGTM!

The validation logic has been properly improved:

Changed minimum threshold from "< 1" to "<= 0" which is more logical

Added proper percentile validation ensuring values are between 0 and 1

Validation order is correct (after CSV load, before conversion)

60-61: LGTM!

The updated comments clearly explain the filtering logic and criteria.

dbasunag · 2025-08-05T14:20:58Z

+        if max_downloads <= 0 or max_downloads > max(lmeval_tasks["HF dataset downloads"]):
+            raise ValueError("Maximum downloads must be greater than 0 and less than the maximum number of downloads")
+        if isinstance(max_downloads, float):
+            if not 0 <= min_downloads <= 1:


Is this a typo?

for more information, see https://pre-commit.ci

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/model_explainability/lm_eval/utils.py (1)
84-84: Consider enhancing the logging message to reflect both filtering parameters.

The logging message only mentions min_downloads but doesn't indicate when max_downloads filtering is also applied, which could be helpful for debugging tiered filtering.

Consider updating the logging message to be more informative:
-    LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")
+    if max_downloads is not None:
+        LOGGER.info(f"Number of unique LMEval tasks with {min_downloads} to {max_downloads} downloads: {len(unique_tasks)}")
+    else:
+        LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 223160c and 2e16d7d.

📒 Files selected for processing (2)

tests/model_explainability/lm_eval/test_lm_eval.py (1 hunks)
tests/model_explainability/lm_eval/utils.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/model_explainability/lm_eval/test_lm_eval.py

🔇 Additional comments (4)

tests/model_explainability/lm_eval/utils.py (4)

39-39: LGTM: Function signature enhancement is well-designed.

The addition of the optional max_downloads parameter with proper type annotation and default value supports the tiered filtering requirements effectively.

50-58: LGTM: Parameter validation is now robust.

The validation logic correctly handles both integer download counts and percentile values, with appropriate error messages and proper ordering of checks.

68-77: LGTM: Maximum downloads filtering is correctly implemented.

The validation and filtering logic for max_downloads properly addresses all previous review concerns, including correct variable usage, logical operators, and percentile validation.

44-45: LGTM: Documentation accurately reflects enhanced functionality.

The updated docstring clearly describes the dual nature of both parameters as supporting either absolute counts or percentiles.

github-actions · 2025-08-07T07:20:51Z

Status of building tag latest: success.
Status of pushing tag latest to image registry: success.

github-actions bot added the ModelExplainability label Jul 30, 2025

github-actions bot assigned christinaexyou Jul 30, 2025

github-actions bot added the size/s label Jul 30, 2025

adolfo-ab suggested changes Jul 31, 2025

View reviewed changes

Comment thread tests/model_explainability/lm_eval/test_lm_eval.py Outdated

rhods-ci-bot added the changes-requested-by-adolfo-ab label Jul 31, 2025

christinaexyou force-pushed the test-lmeval-tier2-tasks branch from 326da02 to 90dc638 Compare July 31, 2025 12:37

christinaexyou marked this pull request as ready for review July 31, 2025 12:37

christinaexyou requested a review from a team as a code owner July 31, 2025 12:37

adolfo-ab previously approved these changes Jul 31, 2025

View reviewed changes

rhods-ci-bot added lgtm-by-adolfo-ab and removed changes-requested-by-adolfo-ab labels Jul 31, 2025

christinaexyou dismissed adolfo-ab’s stale review via 103be81 July 31, 2025 12:41

christinaexyou force-pushed the test-lmeval-tier2-tasks branch 2 times, most recently from 3d6bd3c to 71db44c Compare July 31, 2025 12:43

coderabbitai bot reviewed Jul 31, 2025

View reviewed changes

Comment thread tests/model_explainability/lm_eval/test_lm_eval.py Outdated

Comment thread tests/model_explainability/lm_eval/utils.py Outdated

Comment thread tests/model_explainability/lm_eval/utils.py

christinaexyou force-pushed the test-lmeval-tier2-tasks branch from 2a875a7 to d660729 Compare August 1, 2025 14:12

dbasunag reviewed Aug 1, 2025

View reviewed changes

rhods-ci-bot added the commented-by-dbasunag label Aug 1, 2025

christinaexyou force-pushed the test-lmeval-tier2-tasks branch from 9ad3d3d to 0a5cc63 Compare August 4, 2025 14:41

christinaexyou requested a review from dbasunag August 4, 2025 14:42

coderabbitai bot reviewed Aug 4, 2025

View reviewed changes

Comment thread tests/model_explainability/lm_eval/utils.py

adolfo-ab previously approved these changes Aug 5, 2025

View reviewed changes

dbasunag reviewed Aug 5, 2025

View reviewed changes

feat: Add systematic testing of Tier 2 LMEval tasks

69077b1

christinaexyou dismissed adolfo-ab’s stale review via 69077b1 August 6, 2025 15:41

christinaexyou force-pushed the test-lmeval-tier2-tasks branch from 223160c to 69077b1 Compare August 6, 2025 15:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e16d7d

for more information, see https://pre-commit.ci

adolfo-ab approved these changes Aug 6, 2025

View reviewed changes

coderabbitai bot reviewed Aug 6, 2025

View reviewed changes

dbasunag approved these changes Aug 6, 2025

View reviewed changes

rhods-ci-bot added the lgtm-by-dbasunag label Aug 6, 2025

adolfo-ab merged commit 78118b1 into opendatahub-io:main Aug 7, 2025
11 checks passed



		def get_lmeval_tasks(min_downloads: int = 10000) -> List[str]:
		def get_lmeval_tasks(min_downloads: Union[int, float], max_downloads: Union[int, float, None] = None) -> List[str]:

Conversation

christinaexyou commented Jul 30, 2025

Description

How Has This Been Tested?

Merge criteria:

Uh oh!

coderabbitai bot commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbasunag Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

dbasunag Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dbasunag Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Jul 30, 2025 •

edited

Loading