feat: Add LMEval systematic testing with pod log collection and confi…#382
feat: Add LMEval systematic testing with pod log collection and confi…#382artemsa223 wants to merge 3 commits intoopendatahub-io:mainfrom
Conversation
…gmap parsing Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/. Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.
|
The following are automatically added/executed:
Available user actions:
Supported labels{'/hold', '/cherry-pick', '/lgtm', '/verified', '/build-push-pr-image', '/wip'} |
WalkthroughNew utilities and scripts have been added to automate extraction, selection, and benchmarking of language model evaluation (LMEval) tasks. These include parsing a ConfigMap from GitHub, selecting benchmark tasks, saving pod logs, and introducing parameterized timing benchmark tests. The changes primarily affect test infrastructure and utilities for LMEval jobs. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script(parse_configmap.py)
participant GitHub
participant LocalFS
User->>Script(parse_configmap.py): Run script
Script(parse_configmap.py)->>GitHub: Fetch ConfigMap YAML
GitHub-->>Script(parse_configmap.py): Return YAML
Script(parse_configmap.py)->>Script(parse_configmap.py): Parse YAML, extract tasks
Script(parse_configmap.py)->>LocalFS: Save tasks as JSON, TXT, Python files
Script(parse_configmap.py)->>User: Print next steps and summary
sequenceDiagram
participant User
participant Script(select_benchmark_tasks.py)
participant LocalFS
User->>Script(select_benchmark_tasks.py): Run script
Script(select_benchmark_tasks.py)->>LocalFS: Load all tasks from ConfigMap JSON
Script(select_benchmark_tasks.py)->>Script(select_benchmark_tasks.py): Select benchmark tasks
Script(select_benchmark_tasks.py)->>LocalFS: Save selected tasks (Python, JSON)
Script(select_benchmark_tasks.py)->>LocalFS: Generate pytest test code
Script(select_benchmark_tasks.py)->>User: Print summary and instructions
sequenceDiagram
participant Pytest
participant test_lmeval_timing_benchmark
participant LMEvalJob
participant Pod
participant save_pod_logs
Pytest->>test_lmeval_timing_benchmark: Run test with task parameter
test_lmeval_timing_benchmark->>LMEvalJob: Update and run job
test_lmeval_timing_benchmark->>Pod: Wait for job to run, measure time
test_lmeval_timing_benchmark->>save_pod_logs: Save pod logs
save_pod_logs->>Pod: Retrieve logs
save_pod_logs->>LocalFS: Write logs to file
test_lmeval_timing_benchmark->>Pytest: Assert results
Suggested labels
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (7)
tests/model_explainability/lm_eval/utils.py (1)
6-6: Remove unused import.The
Timeoutimport fromutilities.constantsis not used in this file.-from utilities.constants import Timeouttests/model_explainability/lm_eval/test_lm_eval_timing.py (2)
13-22: Consider using proper mocking instead of custom mock class.The custom
LMEvalJobmock class seems like a workaround. Consider usingunittest.mock.MagicMockor importing the actual class with conditional handling.-# Mock LMEvalJob class for local testing -class LMEvalJob: - """Mock LMEvalJob class for type hinting in skipped GPU tests""" - - def __init__(self): - pass - - def update(self, task_list): - pass +try: + from ocp_resources.lm_eval_job import LMEvalJob +except ImportError: + # Mock for environments where ocp_resources is not available + from unittest.mock import MagicMock + LMEvalJob = MagicMock
86-94: Simplify assertion logic.The path verification assertion is overly complex and could be simplified for better readability.
- assert ( - Path(log_file) - == Path(__file__).parent / "task_lists" / "lmeval_task_logs" / namespace / f"{task_name}_logs.txt" - ), f"Log file saved to incorrect path: {log_file}" + expected_path = Path(__file__).parent / "task_lists" / "lmeval_task_logs" / namespace / f"{task_name}_logs.txt" + assert Path(log_file) == expected_path, f"Log file saved to incorrect path: {log_file}"tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py (2)
10-11: Fix PEP8 formatting: add blank line before function definition.Functions should be preceded by two blank lines according to PEP8.
from pathlib import Path + def load_configmap_tasks():
77-85: Remove unnecessary f-string formatting.Several f-strings don't contain placeholders and should be regular strings for better performance and clarity.
- f.write("# Selected tasks for timing benchmark\n") - f.write("# These are popular/representative LMEval tasks\n\n") + f.write("# Selected tasks for timing benchmark\n") + f.write("# These are popular/representative LMEval tasks\n\n") f.write("TIMING_BENCHMARK_TASKS = [\n") for task in tasks: f.write(f' "{task}",\n') f.write("]\n\n") f.write(f"# Total: {len(tasks)} tasks\n") - f.write("# Estimated time: ~25min × 10 = 4+ hours for 1% sampling\n") + f.write("# Estimated time: ~25min × 10 = 4+ hours for 1% sampling\n")tests/model_explainability/lm_eval/utils/parse_configmap.py (2)
195-195: Remove unused variable.The
filesvariable is assigned but never used.# Save results - files = save_extracted_tasks(tasks, task_details, configmap) + save_extracted_tasks(tasks, task_details, configmap)
14-15: Fix PEP8 formatting: add blank line before function definition.Functions should be preceded by two blank lines according to PEP8.
import yaml + def parse_configmap(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/model_explainability/lm_eval/test_lm_eval_timing.py(1 hunks)tests/model_explainability/lm_eval/utils.py(2 hunks)tests/model_explainability/lm_eval/utils/parse_configmap.py(1 hunks)tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py(1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py
[error] 10-10: expected 2 blank lines, found 1
(E302)
[error] 21-21: expected 2 blank lines, found 1
(E302)
[error] 60-60: expected 2 blank lines, found 1
(E302)
[error] 81-81: f-string is missing placeholders
(F541)
[error] 85-85: expected 2 blank lines, found 1
(E302)
[error] 132-132: expected 2 blank lines, found 1
(E302)
[error] 137-137: f-string is missing placeholders
(F541)
[error] 144-144: f-string is missing placeholders
(F541)
[error] 145-145: f-string is missing placeholders
(F541)
[error] 147-147: f-string is missing placeholders
(F541)
[error] 148-148: f-string is missing placeholders
(F541)
[error] 151-151: expected 2 blank lines after class or function definition, found 1
(E305)
tests/model_explainability/lm_eval/test_lm_eval_timing.py
[error] 8-8: 'utilities.constants.Timeout' imported but unused
(F401)
tests/model_explainability/lm_eval/utils/parse_configmap.py
[error] 14-14: expected 2 blank lines, found 1
(E302)
[error] 47-47: f-string is missing placeholders
(F541)
[error] 108-108: expected 2 blank lines, found 1
(E302)
[error] 137-137: f-string is missing placeholders
(F541)
[error] 155-155: f-string is missing placeholders
(F541)
[error] 168-168: expected 2 blank lines, found 1
(E302)
[error] 170-170: f-string is missing placeholders
(F541)
[error] 171-171: f-string is missing placeholders
(F541)
[error] 172-172: f-string is missing placeholders
(F541)
[error] 173-173: f-string is missing placeholders
(F541)
[error] 174-174: f-string is missing placeholders
(F541)
[error] 175-175: f-string is missing placeholders
(F541)
[error] 177-177: expected 2 blank lines, found 1
(E302)
[error] 197-197: local variable 'files' is assigned to but never used
(F841)
[error] 204-204: expected 2 blank lines after class or function definition, found 1
(E305)
🪛 Pylint (3.3.7)
tests/model_explainability/lm_eval/test_lm_eval_timing.py
[refactor] 15-15: Too few public methods (1/2)
(R0903)
tests/model_explainability/lm_eval/utils/parse_configmap.py
[refactor] 14-14: Too many local variables (20/15)
(R0914)
[refactor] 16-106: Too many nested blocks (6/5)
(R1702)
[refactor] 16-106: Too many nested blocks (7/5)
(R1702)
[refactor] 14-14: Too many statements (55/50)
(R0915)
🪛 Ruff (0.11.9)
tests/model_explainability/lm_eval/utils/parse_configmap.py
197-197: Local variable files is assigned to but never used
Remove assignment to unused variable files
(F841)
| def save_pod_logs(pod: Pod, task_name: str, namespace: str): | ||
| """ | ||
| Save pod logs to a file | ||
| """ | ||
| output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| log_file = output_dir / f"{task_name}_logs.txt" | ||
| logs = pod.get_logs() | ||
| with open(log_file, "w") as f: | ||
| f.write(logs) | ||
| LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}") | ||
| return str(log_file) |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add error handling for pod log retrieval.
The function should handle potential failures when calling pod.get_logs(), as this operation can fail due to network issues, pod status, or permissions.
def save_pod_logs(pod: Pod, task_name: str, namespace: str):
"""
Save pod logs to a file
"""
output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace
output_dir.mkdir(parents=True, exist_ok=True)
log_file = output_dir / f"{task_name}_logs.txt"
- logs = pod.get_logs()
+ try:
+ logs = pod.get_logs()
+ except Exception as e:
+ LOGGER.error(f"Failed to retrieve logs for pod {pod.name}: {e}")
+ logs = f"Error retrieving logs: {e}"
with open(log_file, "w") as f:
f.write(logs)
LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}")
return str(log_file)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def save_pod_logs(pod: Pod, task_name: str, namespace: str): | |
| """ | |
| Save pod logs to a file | |
| """ | |
| output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| log_file = output_dir / f"{task_name}_logs.txt" | |
| logs = pod.get_logs() | |
| with open(log_file, "w") as f: | |
| f.write(logs) | |
| LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}") | |
| return str(log_file) | |
| def save_pod_logs(pod: Pod, task_name: str, namespace: str): | |
| """ | |
| Save pod logs to a file | |
| """ | |
| output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace | |
| output_dir.mkdir(parents=True, exist_ok=True) | |
| log_file = output_dir / f"{task_name}_logs.txt" | |
| try: | |
| logs = pod.get_logs() | |
| except Exception as e: | |
| LOGGER.error(f"Failed to retrieve logs for pod {pod.name}: {e}") | |
| logs = f"Error retrieving logs: {e}" | |
| with open(log_file, "w") as f: | |
| f.write(logs) | |
| LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}") | |
| return str(log_file) |
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils.py around lines 36 to 47, the
function save_pod_logs calls pod.get_logs() without error handling, which can
fail due to network or permission issues. Wrap the pod.get_logs() call in a
try-except block to catch exceptions, log an error message if retrieval fails,
and handle the failure gracefully, such as returning None or an empty string
instead of crashing.
| def generate_test_code(tasks): | ||
| """Generate pytest code for the timing benchmark""" | ||
| test_code = """# Timing benchmark test for LMEval systematic testing | ||
| import pytest | ||
| import time | ||
| from kubernetes.dynamic import DynamicClient | ||
| from ocp_resources.pod import Pod | ||
| from tests.model_explainability.lm_eval.utils import verify_lmevaljob_running, save_pod_logs | ||
| from utilities.constants import Timeout | ||
| from simple_logger.logger import get_logger | ||
|
|
||
| LOGGER = get_logger(name=__name__) | ||
|
|
||
| @pytest.mark.gpu | ||
| @pytest.mark.parametrize( | ||
| "model_namespace, task_name", | ||
| [ | ||
| """ | ||
| for task in tasks: | ||
| test_code += f''' | ||
| pytest.param( | ||
| {{"name": "test-lmeval-timing-{task.replace("_", "-")}"}}, | ||
| "{task}", | ||
| id="{task}" | ||
| ),''' | ||
| test_code += ''' | ||
| ], | ||
| indirect=["model_namespace"], | ||
| ) | ||
| def test_lmeval_timing_benchmark(admin_client: DynamicClient, model_namespace, task_name: str, lmevaljob_vllm_emulator: LMEvalJob, lmevaljob_vllm_emulator_pod: Pod): | ||
| """Timing benchmark test - measures execution time for representative tasks""" | ||
| start_time = time.time() | ||
| lmevaljob_vllm_emulator.update(task_list={"taskNames": [task_name], "sample_size": "1%"}) | ||
| verify_lmevaljob_running(client=admin_client, lmevaljob=lmevaljob_vllm_emulator) | ||
| execution_time = time.time() - start_time | ||
| log_file = save_pod_logs( | ||
| pod=lmevaljob_vllm_emulator_pod, | ||
| task_name=task_name, | ||
| namespace=model_namespace.name, | ||
| ) | ||
| LOGGER.info(f"Task {task_name} took {execution_time:.2f} seconds with 1% sampling. Logs saved to {log_file}") | ||
| ''' | ||
| output_dir = Path(__file__).parent.parent | ||
| with open(output_dir / "test_lm_eval_timing.py", "w") as f: | ||
| f.write(test_code) | ||
| print(f" 🧪 Test code: {output_dir / 'test_lm_eval_timing.py'}") | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Refactor test code generation for better maintainability.
The string concatenation approach for generating test code is hard to maintain. Consider using a template engine or at least break it into smaller, more readable chunks.
Consider using a template-based approach:
def generate_test_code(tasks):
"""Generate pytest code for the timing benchmark"""
- test_code = """# Timing benchmark test for LMEval systematic testing
-import pytest
-import time
-from kubernetes.dynamic import DynamicClient
-from ocp_resources.pod import Pod
-from tests.model_explainability.lm_eval.utils import verify_lmevaljob_running, save_pod_logs
-from utilities.constants import Timeout
-from simple_logger.logger import get_logger
-
-LOGGER = get_logger(name=__name__)
-
-@pytest.mark.gpu
-@pytest.mark.parametrize(
- "model_namespace, task_name",
- [
-"""
+ # Use template approach or break into logical sections
+ header = _generate_test_header()
+ parametrize_section = _generate_parametrize_section(tasks)
+ test_function = _generate_test_function()
+ test_code = header + parametrize_section + test_functionCommittable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Flake8 (7.2.0)
[error] 132-132: expected 2 blank lines, found 1
(E302)
[error] 137-137: f-string is missing placeholders
(F541)
[error] 144-144: f-string is missing placeholders
(F541)
[error] 145-145: f-string is missing placeholders
(F541)
[error] 147-147: f-string is missing placeholders
(F541)
[error] 148-148: f-string is missing placeholders
(F541)
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py around
lines 103 to 149, the test code generation uses extensive string concatenation
which reduces maintainability and readability. Refactor this by using a template
engine like Jinja2 or Python's built-in string.Template to separate the static
test code structure from the dynamic task data. Alternatively, break the code
into smaller string chunks and join them cleanly instead of appending to a
single string, improving clarity and ease of updates.
| def parse_configmap( | ||
| github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml", | ||
| ): | ||
| """Parse the ConfigMap from GitHub to extract LMEval tasks""" | ||
| try: | ||
| # Fetch YAML content from GitHub | ||
| response = requests.get(github_url) | ||
| response.raise_for_status() # Raise exception for bad status codes | ||
| configmap = yaml.safe_load(response.text) | ||
|
|
||
| print(f"✅ Successfully loaded ConfigMap from: {github_url}") | ||
|
|
||
| # Print structure to understand the format | ||
| print("\n📋 ConfigMap Structure:") | ||
| print(f" Kind: {configmap.get('kind')}") | ||
| print(f" Metadata name: {configmap.get('metadata', {}).get('name')}") | ||
|
|
||
| data_keys = list(configmap.get("data", {}).keys()) | ||
| print(f" Data keys: {data_keys}") | ||
|
|
||
| # Extract task information | ||
| tasks = [] | ||
| task_details = [] | ||
| data = configmap.get("data", {}) | ||
|
|
||
| print(f"\n🔍 Found data keys: {list(data.keys())}") | ||
|
|
||
| if "tasks" in data: | ||
| tasks_json = data["tasks"] | ||
| print(f" 📝 Tasks data type: {type(tasks_json)}") | ||
| print(f" 📝 Tasks data preview: {tasks_json[:300]}...") | ||
|
|
||
| try: | ||
| # Parse the JSON string | ||
| parsed_data = json.loads(tasks_json) | ||
| print(" ✅ Successfully parsed JSON") | ||
| print(f" 📊 Top-level keys: {list(parsed_data.keys())}") | ||
|
|
||
| # Extract tasks from "lm-evaluation-harness" array | ||
| if "lm-evaluation-harness" in parsed_data: | ||
| lm_eval_tasks = parsed_data["lm-evaluation-harness"] | ||
| print(f" 📋 Found {len(lm_eval_tasks)} lm-evaluation-harness tasks") | ||
|
|
||
| for task_obj in lm_eval_tasks: | ||
| if isinstance(task_obj, dict) and "name" in task_obj: | ||
| task_name = task_obj["name"] | ||
| task_desc = task_obj.get("description", "No description") | ||
| tasks.append(task_name) | ||
| task_details.append({"name": task_name, "description": task_desc}) | ||
|
|
||
| # Check for other task categories | ||
| for key, value in parsed_data.items(): | ||
| if key != "lm-evaluation-harness" and isinstance(value, list): | ||
| print(f" 📋 Found additional category '{key}' with {len(value)} tasks") | ||
| for task_obj in value: | ||
| if isinstance(task_obj, dict) and "name" in task_obj: | ||
| task_name = task_obj["name"] | ||
| task_desc = task_obj.get("description", "No description") | ||
| tasks.append(task_name) | ||
| task_details.append({"name": task_name, "description": task_desc, "category": key}) | ||
|
|
||
| except json.JSONDecodeError as e: | ||
| print(f" ❌ JSON parsing failed: {e}") | ||
| # Fallback to regex extraction | ||
| name_matches = re.findall(r'"name":\s*"([^"]+)"', tasks_json) | ||
| tasks.extend(name_matches) | ||
| print(f" 🔄 Fallback: extracted {len(name_matches)} task names using regex") | ||
|
|
||
| # Clean and deduplicate tasks | ||
| clean_tasks = [task.strip() for task in tasks if isinstance(task, str) and task.strip()] | ||
| unique_tasks = sorted(set(clean_tasks)) | ||
|
|
||
| print(f"\n✅ Extracted {len(unique_tasks)} unique tasks from ConfigMap") | ||
|
|
||
| # Show categories if found | ||
| categories = {detail["category"] for detail in task_details if "category" in detail} | ||
| if categories: | ||
| print(f" 📂 Task categories found: {', '.join(categories)}") | ||
|
|
||
| return unique_tasks, task_details, configmap | ||
|
|
||
| except requests.exceptions.RequestException as e: | ||
| print(f"❌ Failed to fetch ConfigMap from GitHub: {e}") | ||
| print(f"💡 Ensure the GitHub URL is accessible: {github_url}") | ||
| return None, None, None | ||
| except yaml.YAMLError as e: | ||
| print(f"❌ Error parsing YAML: {e}") | ||
| return None, None, None | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Refactor function to reduce complexity.
The parse_configmap function is too complex with excessive nesting and multiple responsibilities. Consider breaking it into smaller functions.
Consider refactoring into separate functions:
def fetch_configmap_data(github_url: str) -> dict:
"""Fetch and parse YAML from GitHub"""
# URL validation and fetching logic
def extract_tasks_from_json(tasks_json: str) -> tuple:
"""Extract tasks from JSON string"""
# JSON parsing and task extraction logic
def parse_configmap(github_url: str = "...") -> tuple:
"""Main orchestration function"""
configmap = fetch_configmap_data(github_url)
tasks, task_details = extract_tasks_from_json(configmap["data"]["tasks"])
return clean_and_deduplicate_tasks(tasks), task_details, configmap🧰 Tools
🪛 Flake8 (7.2.0)
[error] 47-47: f-string is missing placeholders
(F541)
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/parse_configmap.py from lines 15 to
103, the parse_configmap function is overly complex with deep nesting and
multiple responsibilities. Refactor by splitting it into smaller functions: one
to fetch and parse the YAML configmap from GitHub (fetch_configmap_data),
another to extract tasks from the JSON string in the configmap data
(extract_tasks_from_json), and a main orchestration function (parse_configmap)
that calls these helpers and returns cleaned, deduplicated tasks along with task
details and the configmap. This will improve readability and maintainability.
| def parse_configmap( | ||
| github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml", | ||
| ): | ||
| """Parse the ConfigMap from GitHub to extract LMEval tasks""" | ||
| try: | ||
| # Fetch YAML content from GitHub | ||
| response = requests.get(github_url) | ||
| response.raise_for_status() # Raise exception for bad status codes | ||
| configmap = yaml.safe_load(response.text) |
There was a problem hiding this comment.
Add URL validation and security considerations.
Fetching data from external URLs poses security risks. Consider adding URL validation and timeout settings.
def parse_configmap(
github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
):
"""Parse the ConfigMap from GitHub to extract LMEval tasks"""
try:
+ # Validate URL is from expected domain
+ if not github_url.startswith("https://raw.githubusercontent.com/"):
+ raise ValueError(f"URL must be from GitHub raw content: {github_url}")
+
# Fetch YAML content from GitHub
- response = requests.get(github_url)
+ response = requests.get(github_url, timeout=30)
response.raise_for_status() # Raise exception for bad status codes📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def parse_configmap( | |
| github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml", | |
| ): | |
| """Parse the ConfigMap from GitHub to extract LMEval tasks""" | |
| try: | |
| # Fetch YAML content from GitHub | |
| response = requests.get(github_url) | |
| response.raise_for_status() # Raise exception for bad status codes | |
| configmap = yaml.safe_load(response.text) | |
| def parse_configmap( | |
| github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml", | |
| ): | |
| """Parse the ConfigMap from GitHub to extract LMEval tasks""" | |
| try: | |
| + # Validate URL is from expected domain | |
| + if not github_url.startswith("https://raw.githubusercontent.com/"): | |
| + raise ValueError(f"URL must be from GitHub raw content: {github_url}") | |
| + | |
| # Fetch YAML content from GitHub | |
| - response = requests.get(github_url) | |
| + response = requests.get(github_url, timeout=30) | |
| response.raise_for_status() # Raise exception for bad status codes | |
| configmap = yaml.safe_load(response.text) |
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/parse_configmap.py around lines 15
to 23, the function fetches data from an external URL without validating the URL
or setting a timeout, which can cause security and reliability issues. Add
validation to ensure the github_url is a valid and expected URL format, and
include a timeout parameter in the requests.get call to prevent hanging
requests. This will improve security and robustness when fetching the ConfigMap.
…gmap parsing
Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml
Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/.
Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.
Add LMEval systematic testing with pod log collection and configmap parsing
Description
Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml
Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/.
Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.
How Has This Been Tested?
The changes were tested locally using Python 3.12 in a virtual environment (venv3.12) on macOS. The test suite includes mock tests to validate functionality without GPU cluster access:
Mock Test in test_lm_eval_timing.py:
pytest tests/model_explainability/lm_eval/test_lm_eval_timing.py -k test_lmeval_timing_benchmark_mock -v
Verified logging for the arc_easy task, confirming logs are saved to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/test-namespace/arc_easy_logs.txt with correct content (Mock arc_easy task output Completed successfully).
Configmap Parsing:
Ran parse_configmap.py with fetch from https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml
python tests/model_explainability/lm_eval/utils/parse_configmap.py
Verified generation of configmap_tasks.py, configmap_tasks.json, and task_names_only.txt in tests/model_explainability/lm_eval/task_lists/, containing ~6,800 tasks.
Merge criteria:
Summary by CodeRabbit
New Features
Tests
Chores