feat: Add systematic testing of Tier 2 LMEval tasks#478
feat: Add systematic testing of Tier 2 LMEval tasks#478adolfo-ab merged 2 commits intoopendatahub-io:mainfrom
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe changes split the LMEval task list into two tiers based on download counts, updating test logic to run each tier separately. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/verified', '/wip', '/lgtm', '/hold', '/build-push-pr-image', '/cherry-pick'} |
326da02 to
90dc638
Compare
3d6bd3c to
71db44c
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/utils.py (1)
80-80: Update logging message to reflect complete filtering criteria.The log message only mentions
min_downloadsbut the filtering now also considersmax_downloadswhen provided.Apply this diff to improve the logging:
- LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}") + if max_downloads is not None: + LOGGER.info(f"Number of unique LMEval tasks with {min_downloads}-{max_downloads} downloads: {len(unique_tasks)}") + else: + LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)tests/model_explainability/lm_eval/utils.py(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
tests/model_explainability/lm_eval/test_lm_eval.py (1)
tests/model_explainability/lm_eval/utils.py (1)
get_lmeval_tasks(39-82)
🔇 Additional comments (5)
tests/model_explainability/lm_eval/utils.py (3)
1-1: LGTM!The addition of
Unionto the typing imports is necessary for the enhanced function signature.
39-49: Well-designed function signature enhancement.The updated signature appropriately uses Union types to support both absolute download counts and percentile-based filtering. The documentation clearly explains the dual functionality of the parameters.
58-64: Clear and accurate filtering logic.The updated comments accurately describe the enhanced filtering criteria, maintaining both download-based and OpenLLM leaderboard-based selection.
tests/model_explainability/lm_eval/test_lm_eval.py (2)
12-12: LGTM!TIER1_LMEVAL_TASKS correctly maintains the original filtering behavior for high-download tasks.
20-27: Well-structured test parameterization for tiered testing.The test parameters are properly structured to support separate execution of tier 1 and tier 2 tasks, with clear naming conventions for the namespaces.
2a875a7 to
d660729
Compare
|
|
||
|
|
||
| def get_lmeval_tasks(min_downloads: int = 10000) -> List[str]: | ||
| def get_lmeval_tasks(min_downloads: Union[int, float], max_downloads: Union[int, float, None] = None) -> List[str]: |
There was a problem hiding this comment.
Please use |. We are using python 3.13, so we don't need to use Union and | is recommended now.
| # if max_downloads is provided, filter for tasks that have less than | ||
| # or equal to the maximum number of downloads | ||
| if max_downloads is not None: | ||
| if max_downloads <= 0 | max_downloads > max(lmeval_tasks["HF dataset downloads"]): |
9ad3d3d to
0a5cc63
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)tests/model_explainability/lm_eval/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/model_explainability/lm_eval/test_lm_eval.py
🔇 Additional comments (4)
tests/model_explainability/lm_eval/utils.py (4)
39-39: LGTM!The function signature correctly uses modern Python union syntax and appropriately allows both integer and float types for download thresholds.
44-45: LGTM!The parameter documentation clearly explains the dual purpose of the parameters (download counts vs percentiles).
50-58: LGTM!The validation logic has been properly improved:
- Changed minimum threshold from "< 1" to "<= 0" which is more logical
- Added proper percentile validation ensuring values are between 0 and 1
- Validation order is correct (after CSV load, before conversion)
60-61: LGTM!The updated comments clearly explain the filtering logic and criteria.
| if max_downloads <= 0 or max_downloads > max(lmeval_tasks["HF dataset downloads"]): | ||
| raise ValueError("Maximum downloads must be greater than 0 and less than the maximum number of downloads") | ||
| if isinstance(max_downloads, float): | ||
| if not 0 <= min_downloads <= 1: |
223160c to
69077b1
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/model_explainability/lm_eval/utils.py (1)
84-84: Consider enhancing the logging message to reflect both filtering parameters.The logging message only mentions
min_downloadsbut doesn't indicate whenmax_downloadsfiltering is also applied, which could be helpful for debugging tiered filtering.Consider updating the logging message to be more informative:
- LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}") + if max_downloads is not None: + LOGGER.info(f"Number of unique LMEval tasks with {min_downloads} to {max_downloads} downloads: {len(unique_tasks)}") + else: + LOGGER.info(f"Number of unique LMEval tasks with more than {min_downloads} downloads: {len(unique_tasks)}")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/model_explainability/lm_eval/test_lm_eval.py(1 hunks)tests/model_explainability/lm_eval/utils.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/model_explainability/lm_eval/test_lm_eval.py
🔇 Additional comments (4)
tests/model_explainability/lm_eval/utils.py (4)
39-39: LGTM: Function signature enhancement is well-designed.The addition of the optional
max_downloadsparameter with proper type annotation and default value supports the tiered filtering requirements effectively.
50-58: LGTM: Parameter validation is now robust.The validation logic correctly handles both integer download counts and percentile values, with appropriate error messages and proper ordering of checks.
68-77: LGTM: Maximum downloads filtering is correctly implemented.The validation and filtering logic for
max_downloadsproperly addresses all previous review concerns, including correct variable usage, logical operators, and percentile validation.
44-45: LGTM: Documentation accurately reflects enhanced functionality.The updated docstring clearly describes the dual nature of both parameters as supporting either absolute counts or percentiles.
|
Status of building tag latest: success. |
Description
This PR adds testing for Tier 2 tasks for LM-Eval. Tier 2 tasks are defined by tasks that are in the 70th percentile of downloads on HuggingFace but have less than 10,000 downloads.
How Has This Been Tested?
Merge criteria: