Skip to content

feat: Add LMEval systematic testing with pod log collection and confi…#382

Closed
artemsa223 wants to merge 3 commits intoopendatahub-io:mainfrom
artemsa223:lmeval-systematic-logging
Closed

feat: Add LMEval systematic testing with pod log collection and confi…#382
artemsa223 wants to merge 3 commits intoopendatahub-io:mainfrom
artemsa223:lmeval-systematic-logging

Conversation

@artemsa223
Copy link
Copy Markdown

@artemsa223 artemsa223 commented Jun 26, 2025

…gmap parsing

Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml

Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/.

Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.

Add LMEval systematic testing with pod log collection and configmap parsing

Description

Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml

Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/.

Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.

How Has This Been Tested?

The changes were tested locally using Python 3.12 in a virtual environment (venv3.12) on macOS. The test suite includes mock tests to validate functionality without GPU cluster access:
Mock Test in test_lm_eval_timing.py:
pytest tests/model_explainability/lm_eval/test_lm_eval_timing.py -k test_lmeval_timing_benchmark_mock -v
Verified logging for the arc_easy task, confirming logs are saved to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/test-namespace/arc_easy_logs.txt with correct content (Mock arc_easy task output Completed successfully).

Configmap Parsing:
Ran parse_configmap.py with fetch from https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml
python tests/model_explainability/lm_eval/utils/parse_configmap.py
Verified generation of configmap_tasks.py, configmap_tasks.json, and task_names_only.txt in tests/model_explainability/lm_eval/task_lists/, containing ~6,800 tasks.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added timing benchmark tests for language model evaluation tasks, including both GPU-dependent and mock tests.
    • Introduced automated scripts to parse task definitions from a remote ConfigMap and extract, process, and save task lists in multiple formats.
    • Added a script to select representative tasks for timing benchmarks and generate corresponding test files.
  • Tests

    • Implemented new tests to measure execution time and save logs for evaluation jobs, including validation of log file creation and content.
  • Chores

    • Enhanced utilities for saving pod logs during evaluation tasks.

…gmap parsing

Implemented parse_configmap.py to fetch trustyai-lmeval-tasks.yaml

Added logging utils.py to save logs to tests/model_explainability/lm_eval/task_lists/lmeval_task_logs/.

Added test_lm_eval_timing.py with mock test and mock LMEvalJob to avoid NameError.
@github-actions
Copy link
Copy Markdown

The following are automatically added/executed:

  • PR size label.
  • Run pre-commit
  • Run tox
  • Add PR author as the PR assignee
  • Build image based on the PR

Available user actions:

  • To mark a PR as WIP, add /wip in a comment. To remove it from the PR comment /wip cancel to the PR.
  • To block merging of a PR, add /hold in a comment. To un-block merging of PR comment /hold cancel.
  • To mark a PR as approved, add /lgtm in a comment. To remove, add /lgtm cancel.
    lgtm label removed on each new commit push.
  • To mark PR as verified comment /verified to the PR, to un-verify comment /verified cancel to the PR.
    verified label removed on each new commit push.
  • To Cherry-pick a merged PR /cherry-pick <target_branch_name> to the PR. If <target_branch_name> is valid,
    and the current PR is merged, a cherry-picked PR would be created and linked to the current PR.
  • To build and push image to quay, add /build-push-pr-image in a comment. This would create an image with tag
    pr-<pr_number> to quay repository. This image tag, however would be deleted on PR merge or close action.
Supported labels

{'/hold', '/cherry-pick', '/lgtm', '/verified', '/build-push-pr-image', '/wip'}

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jun 26, 2025

Walkthrough

New utilities and scripts have been added to automate extraction, selection, and benchmarking of language model evaluation (LMEval) tasks. These include parsing a ConfigMap from GitHub, selecting benchmark tasks, saving pod logs, and introducing parameterized timing benchmark tests. The changes primarily affect test infrastructure and utilities for LMEval jobs.

Changes

File(s) Change Summary
tests/model_explainability/lm_eval/test_lm_eval_timing.py Added parameterized and mock timing benchmark tests for LMEvalJob, including pod log saving and runtime measurement.
tests/model_explainability/lm_eval/utils.py Introduced save_pod_logs function to persist pod logs for a given task and namespace.
tests/model_explainability/lm_eval/utils/parse_configmap.py New script to fetch, parse, and extract task lists from a remote ConfigMap YAML, saving results in multiple formats.
tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py New script to select and save representative benchmark tasks, and generate corresponding pytest test code.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script(parse_configmap.py)
    participant GitHub
    participant LocalFS

    User->>Script(parse_configmap.py): Run script
    Script(parse_configmap.py)->>GitHub: Fetch ConfigMap YAML
    GitHub-->>Script(parse_configmap.py): Return YAML
    Script(parse_configmap.py)->>Script(parse_configmap.py): Parse YAML, extract tasks
    Script(parse_configmap.py)->>LocalFS: Save tasks as JSON, TXT, Python files
    Script(parse_configmap.py)->>User: Print next steps and summary
Loading
sequenceDiagram
    participant User
    participant Script(select_benchmark_tasks.py)
    participant LocalFS

    User->>Script(select_benchmark_tasks.py): Run script
    Script(select_benchmark_tasks.py)->>LocalFS: Load all tasks from ConfigMap JSON
    Script(select_benchmark_tasks.py)->>Script(select_benchmark_tasks.py): Select benchmark tasks
    Script(select_benchmark_tasks.py)->>LocalFS: Save selected tasks (Python, JSON)
    Script(select_benchmark_tasks.py)->>LocalFS: Generate pytest test code
    Script(select_benchmark_tasks.py)->>User: Print summary and instructions
Loading
sequenceDiagram
    participant Pytest
    participant test_lmeval_timing_benchmark
    participant LMEvalJob
    participant Pod
    participant save_pod_logs

    Pytest->>test_lmeval_timing_benchmark: Run test with task parameter
    test_lmeval_timing_benchmark->>LMEvalJob: Update and run job
    test_lmeval_timing_benchmark->>Pod: Wait for job to run, measure time
    test_lmeval_timing_benchmark->>save_pod_logs: Save pod logs
    save_pod_logs->>Pod: Retrieve logs
    save_pod_logs->>LocalFS: Write logs to file
    test_lmeval_timing_benchmark->>Pytest: Assert results
Loading

Suggested labels

Verified, size/l

Poem

In the warren of code, with scripts so bright,
Rabbits parsed ConfigMaps by pale monitor light.
Tasks were selected, benchmarks set with care,
Pod logs were gathered from clusters out there.
With timing and testing, our burrow’s delight—
The garden of LMEval grows ever more right!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
tests/model_explainability/lm_eval/utils.py (1)

6-6: Remove unused import.

The Timeout import from utilities.constants is not used in this file.

-from utilities.constants import Timeout
tests/model_explainability/lm_eval/test_lm_eval_timing.py (2)

13-22: Consider using proper mocking instead of custom mock class.

The custom LMEvalJob mock class seems like a workaround. Consider using unittest.mock.MagicMock or importing the actual class with conditional handling.

-# Mock LMEvalJob class for local testing
-class LMEvalJob:
-    """Mock LMEvalJob class for type hinting in skipped GPU tests"""
-
-    def __init__(self):
-        pass
-
-    def update(self, task_list):
-        pass
+try:
+    from ocp_resources.lm_eval_job import LMEvalJob
+except ImportError:
+    # Mock for environments where ocp_resources is not available
+    from unittest.mock import MagicMock
+    LMEvalJob = MagicMock

86-94: Simplify assertion logic.

The path verification assertion is overly complex and could be simplified for better readability.

-    assert (
-        Path(log_file)
-        == Path(__file__).parent / "task_lists" / "lmeval_task_logs" / namespace / f"{task_name}_logs.txt"
-    ), f"Log file saved to incorrect path: {log_file}"
+    expected_path = Path(__file__).parent / "task_lists" / "lmeval_task_logs" / namespace / f"{task_name}_logs.txt"
+    assert Path(log_file) == expected_path, f"Log file saved to incorrect path: {log_file}"
tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py (2)

10-11: Fix PEP8 formatting: add blank line before function definition.

Functions should be preceded by two blank lines according to PEP8.

 from pathlib import Path

+
 def load_configmap_tasks():

77-85: Remove unnecessary f-string formatting.

Several f-strings don't contain placeholders and should be regular strings for better performance and clarity.

-        f.write("# Selected tasks for timing benchmark\n")
-        f.write("# These are popular/representative LMEval tasks\n\n")
+        f.write("# Selected tasks for timing benchmark\n")
+        f.write("# These are popular/representative LMEval tasks\n\n")
         f.write("TIMING_BENCHMARK_TASKS = [\n")
         for task in tasks:
             f.write(f'    "{task}",\n')
         f.write("]\n\n")
         f.write(f"# Total: {len(tasks)} tasks\n")
-        f.write("# Estimated time: ~25min × 10 = 4+ hours for 1% sampling\n")
+        f.write("# Estimated time: ~25min × 10 = 4+ hours for 1% sampling\n")
tests/model_explainability/lm_eval/utils/parse_configmap.py (2)

195-195: Remove unused variable.

The files variable is assigned but never used.

     # Save results
-    files = save_extracted_tasks(tasks, task_details, configmap)
+    save_extracted_tasks(tasks, task_details, configmap)

14-15: Fix PEP8 formatting: add blank line before function definition.

Functions should be preceded by two blank lines according to PEP8.

 import yaml

+
 def parse_configmap(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 906091d and 767fd05.

📒 Files selected for processing (4)
  • tests/model_explainability/lm_eval/test_lm_eval_timing.py (1 hunks)
  • tests/model_explainability/lm_eval/utils.py (2 hunks)
  • tests/model_explainability/lm_eval/utils/parse_configmap.py (1 hunks)
  • tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py (1 hunks)
🧰 Additional context used
🪛 Flake8 (7.2.0)
tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py

[error] 10-10: expected 2 blank lines, found 1

(E302)


[error] 21-21: expected 2 blank lines, found 1

(E302)


[error] 60-60: expected 2 blank lines, found 1

(E302)


[error] 81-81: f-string is missing placeholders

(F541)


[error] 85-85: expected 2 blank lines, found 1

(E302)


[error] 132-132: expected 2 blank lines, found 1

(E302)


[error] 137-137: f-string is missing placeholders

(F541)


[error] 144-144: f-string is missing placeholders

(F541)


[error] 145-145: f-string is missing placeholders

(F541)


[error] 147-147: f-string is missing placeholders

(F541)


[error] 148-148: f-string is missing placeholders

(F541)


[error] 151-151: expected 2 blank lines after class or function definition, found 1

(E305)

tests/model_explainability/lm_eval/test_lm_eval_timing.py

[error] 8-8: 'utilities.constants.Timeout' imported but unused

(F401)

tests/model_explainability/lm_eval/utils/parse_configmap.py

[error] 14-14: expected 2 blank lines, found 1

(E302)


[error] 47-47: f-string is missing placeholders

(F541)


[error] 108-108: expected 2 blank lines, found 1

(E302)


[error] 137-137: f-string is missing placeholders

(F541)


[error] 155-155: f-string is missing placeholders

(F541)


[error] 168-168: expected 2 blank lines, found 1

(E302)


[error] 170-170: f-string is missing placeholders

(F541)


[error] 171-171: f-string is missing placeholders

(F541)


[error] 172-172: f-string is missing placeholders

(F541)


[error] 173-173: f-string is missing placeholders

(F541)


[error] 174-174: f-string is missing placeholders

(F541)


[error] 175-175: f-string is missing placeholders

(F541)


[error] 177-177: expected 2 blank lines, found 1

(E302)


[error] 197-197: local variable 'files' is assigned to but never used

(F841)


[error] 204-204: expected 2 blank lines after class or function definition, found 1

(E305)

🪛 Pylint (3.3.7)
tests/model_explainability/lm_eval/test_lm_eval_timing.py

[refactor] 15-15: Too few public methods (1/2)

(R0903)

tests/model_explainability/lm_eval/utils/parse_configmap.py

[refactor] 14-14: Too many local variables (20/15)

(R0914)


[refactor] 16-106: Too many nested blocks (6/5)

(R1702)


[refactor] 16-106: Too many nested blocks (7/5)

(R1702)


[refactor] 14-14: Too many statements (55/50)

(R0915)

🪛 Ruff (0.11.9)
tests/model_explainability/lm_eval/utils/parse_configmap.py

197-197: Local variable files is assigned to but never used

Remove assignment to unused variable files

(F841)

Comment on lines +36 to +47
def save_pod_logs(pod: Pod, task_name: str, namespace: str):
"""
Save pod logs to a file
"""
output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace
output_dir.mkdir(parents=True, exist_ok=True)
log_file = output_dir / f"{task_name}_logs.txt"
logs = pod.get_logs()
with open(log_file, "w") as f:
f.write(logs)
LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}")
return str(log_file)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for pod log retrieval.

The function should handle potential failures when calling pod.get_logs(), as this operation can fail due to network issues, pod status, or permissions.

 def save_pod_logs(pod: Pod, task_name: str, namespace: str):
     """
     Save pod logs to a file
     """
     output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace
     output_dir.mkdir(parents=True, exist_ok=True)
     log_file = output_dir / f"{task_name}_logs.txt"
-    logs = pod.get_logs()
+    try:
+        logs = pod.get_logs()
+    except Exception as e:
+        LOGGER.error(f"Failed to retrieve logs for pod {pod.name}: {e}")
+        logs = f"Error retrieving logs: {e}"
     with open(log_file, "w") as f:
         f.write(logs)
     LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}")
     return str(log_file)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def save_pod_logs(pod: Pod, task_name: str, namespace: str):
"""
Save pod logs to a file
"""
output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace
output_dir.mkdir(parents=True, exist_ok=True)
log_file = output_dir / f"{task_name}_logs.txt"
logs = pod.get_logs()
with open(log_file, "w") as f:
f.write(logs)
LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}")
return str(log_file)
def save_pod_logs(pod: Pod, task_name: str, namespace: str):
"""
Save pod logs to a file
"""
output_dir = Path(__file__).parent.parent / "lm_eval" / "task_lists" / "lmeval_task_logs" / namespace
output_dir.mkdir(parents=True, exist_ok=True)
log_file = output_dir / f"{task_name}_logs.txt"
try:
logs = pod.get_logs()
except Exception as e:
LOGGER.error(f"Failed to retrieve logs for pod {pod.name}: {e}")
logs = f"Error retrieving logs: {e}"
with open(log_file, "w") as f:
f.write(logs)
LOGGER.info(f"Saved logs for task {task_name} (pod: {pod.name}) to {log_file}")
return str(log_file)
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils.py around lines 36 to 47, the
function save_pod_logs calls pod.get_logs() without error handling, which can
fail due to network or permission issues. Wrap the pod.get_logs() call in a
try-except block to catch exceptions, log an error message if retrieval fails,
and handle the failure gracefully, such as returning None or an empty string
instead of crashing.

Comment on lines +103 to +149
def generate_test_code(tasks):
"""Generate pytest code for the timing benchmark"""
test_code = """# Timing benchmark test for LMEval systematic testing
import pytest
import time
from kubernetes.dynamic import DynamicClient
from ocp_resources.pod import Pod
from tests.model_explainability.lm_eval.utils import verify_lmevaljob_running, save_pod_logs
from utilities.constants import Timeout
from simple_logger.logger import get_logger

LOGGER = get_logger(name=__name__)

@pytest.mark.gpu
@pytest.mark.parametrize(
"model_namespace, task_name",
[
"""
for task in tasks:
test_code += f'''
pytest.param(
{{"name": "test-lmeval-timing-{task.replace("_", "-")}"}},
"{task}",
id="{task}"
),'''
test_code += '''
],
indirect=["model_namespace"],
)
def test_lmeval_timing_benchmark(admin_client: DynamicClient, model_namespace, task_name: str, lmevaljob_vllm_emulator: LMEvalJob, lmevaljob_vllm_emulator_pod: Pod):
"""Timing benchmark test - measures execution time for representative tasks"""
start_time = time.time()
lmevaljob_vllm_emulator.update(task_list={"taskNames": [task_name], "sample_size": "1%"})
verify_lmevaljob_running(client=admin_client, lmevaljob=lmevaljob_vllm_emulator)
execution_time = time.time() - start_time
log_file = save_pod_logs(
pod=lmevaljob_vllm_emulator_pod,
task_name=task_name,
namespace=model_namespace.name,
)
LOGGER.info(f"Task {task_name} took {execution_time:.2f} seconds with 1% sampling. Logs saved to {log_file}")
'''
output_dir = Path(__file__).parent.parent
with open(output_dir / "test_lm_eval_timing.py", "w") as f:
f.write(test_code)
print(f" 🧪 Test code: {output_dir / 'test_lm_eval_timing.py'}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor test code generation for better maintainability.

The string concatenation approach for generating test code is hard to maintain. Consider using a template engine or at least break it into smaller, more readable chunks.

Consider using a template-based approach:

 def generate_test_code(tasks):
     """Generate pytest code for the timing benchmark"""
-    test_code = """# Timing benchmark test for LMEval systematic testing
-import pytest
-import time
-from kubernetes.dynamic import DynamicClient
-from ocp_resources.pod import Pod
-from tests.model_explainability.lm_eval.utils import verify_lmevaljob_running, save_pod_logs
-from utilities.constants import Timeout
-from simple_logger.logger import get_logger
-
-LOGGER = get_logger(name=__name__)
-
-@pytest.mark.gpu
-@pytest.mark.parametrize(
-    "model_namespace, task_name",
-    [
-"""
+    # Use template approach or break into logical sections
+    header = _generate_test_header()
+    parametrize_section = _generate_parametrize_section(tasks)
+    test_function = _generate_test_function()
+    test_code = header + parametrize_section + test_function

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Flake8 (7.2.0)

[error] 132-132: expected 2 blank lines, found 1

(E302)


[error] 137-137: f-string is missing placeholders

(F541)


[error] 144-144: f-string is missing placeholders

(F541)


[error] 145-145: f-string is missing placeholders

(F541)


[error] 147-147: f-string is missing placeholders

(F541)


[error] 148-148: f-string is missing placeholders

(F541)

🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/select_benchmark_tasks.py around
lines 103 to 149, the test code generation uses extensive string concatenation
which reduces maintainability and readability. Refactor this by using a template
engine like Jinja2 or Python's built-in string.Template to separate the static
test code structure from the dynamic task data. Alternatively, break the code
into smaller string chunks and join them cleanly instead of appending to a
single string, improving clarity and ease of updates.

Comment on lines +15 to +103
def parse_configmap(
github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
):
"""Parse the ConfigMap from GitHub to extract LMEval tasks"""
try:
# Fetch YAML content from GitHub
response = requests.get(github_url)
response.raise_for_status() # Raise exception for bad status codes
configmap = yaml.safe_load(response.text)

print(f"✅ Successfully loaded ConfigMap from: {github_url}")

# Print structure to understand the format
print("\n📋 ConfigMap Structure:")
print(f" Kind: {configmap.get('kind')}")
print(f" Metadata name: {configmap.get('metadata', {}).get('name')}")

data_keys = list(configmap.get("data", {}).keys())
print(f" Data keys: {data_keys}")

# Extract task information
tasks = []
task_details = []
data = configmap.get("data", {})

print(f"\n🔍 Found data keys: {list(data.keys())}")

if "tasks" in data:
tasks_json = data["tasks"]
print(f" 📝 Tasks data type: {type(tasks_json)}")
print(f" 📝 Tasks data preview: {tasks_json[:300]}...")

try:
# Parse the JSON string
parsed_data = json.loads(tasks_json)
print(" ✅ Successfully parsed JSON")
print(f" 📊 Top-level keys: {list(parsed_data.keys())}")

# Extract tasks from "lm-evaluation-harness" array
if "lm-evaluation-harness" in parsed_data:
lm_eval_tasks = parsed_data["lm-evaluation-harness"]
print(f" 📋 Found {len(lm_eval_tasks)} lm-evaluation-harness tasks")

for task_obj in lm_eval_tasks:
if isinstance(task_obj, dict) and "name" in task_obj:
task_name = task_obj["name"]
task_desc = task_obj.get("description", "No description")
tasks.append(task_name)
task_details.append({"name": task_name, "description": task_desc})

# Check for other task categories
for key, value in parsed_data.items():
if key != "lm-evaluation-harness" and isinstance(value, list):
print(f" 📋 Found additional category '{key}' with {len(value)} tasks")
for task_obj in value:
if isinstance(task_obj, dict) and "name" in task_obj:
task_name = task_obj["name"]
task_desc = task_obj.get("description", "No description")
tasks.append(task_name)
task_details.append({"name": task_name, "description": task_desc, "category": key})

except json.JSONDecodeError as e:
print(f" ❌ JSON parsing failed: {e}")
# Fallback to regex extraction
name_matches = re.findall(r'"name":\s*"([^"]+)"', tasks_json)
tasks.extend(name_matches)
print(f" 🔄 Fallback: extracted {len(name_matches)} task names using regex")

# Clean and deduplicate tasks
clean_tasks = [task.strip() for task in tasks if isinstance(task, str) and task.strip()]
unique_tasks = sorted(set(clean_tasks))

print(f"\n✅ Extracted {len(unique_tasks)} unique tasks from ConfigMap")

# Show categories if found
categories = {detail["category"] for detail in task_details if "category" in detail}
if categories:
print(f" 📂 Task categories found: {', '.join(categories)}")

return unique_tasks, task_details, configmap

except requests.exceptions.RequestException as e:
print(f"❌ Failed to fetch ConfigMap from GitHub: {e}")
print(f"💡 Ensure the GitHub URL is accessible: {github_url}")
return None, None, None
except yaml.YAMLError as e:
print(f"❌ Error parsing YAML: {e}")
return None, None, None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor function to reduce complexity.

The parse_configmap function is too complex with excessive nesting and multiple responsibilities. Consider breaking it into smaller functions.

Consider refactoring into separate functions:

def fetch_configmap_data(github_url: str) -> dict:
    """Fetch and parse YAML from GitHub"""
    # URL validation and fetching logic

def extract_tasks_from_json(tasks_json: str) -> tuple:
    """Extract tasks from JSON string"""
    # JSON parsing and task extraction logic

def parse_configmap(github_url: str = "...") -> tuple:
    """Main orchestration function"""
    configmap = fetch_configmap_data(github_url)
    tasks, task_details = extract_tasks_from_json(configmap["data"]["tasks"])
    return clean_and_deduplicate_tasks(tasks), task_details, configmap
🧰 Tools
🪛 Flake8 (7.2.0)

[error] 47-47: f-string is missing placeholders

(F541)

🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/parse_configmap.py from lines 15 to
103, the parse_configmap function is overly complex with deep nesting and
multiple responsibilities. Refactor by splitting it into smaller functions: one
to fetch and parse the YAML configmap from GitHub (fetch_configmap_data),
another to extract tasks from the JSON string in the configmap data
(extract_tasks_from_json), and a main orchestration function (parse_configmap)
that calls these helpers and returns cleaned, deduplicated tasks along with task
details and the configmap. This will improve readability and maintainability.

Comment on lines +15 to +23
def parse_configmap(
github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
):
"""Parse the ConfigMap from GitHub to extract LMEval tasks"""
try:
# Fetch YAML content from GitHub
response = requests.get(github_url)
response.raise_for_status() # Raise exception for bad status codes
configmap = yaml.safe_load(response.text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add URL validation and security considerations.

Fetching data from external URLs poses security risks. Consider adding URL validation and timeout settings.

 def parse_configmap(
     github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
 ):
     """Parse the ConfigMap from GitHub to extract LMEval tasks"""
     try:
+        # Validate URL is from expected domain
+        if not github_url.startswith("https://raw.githubusercontent.com/"):
+            raise ValueError(f"URL must be from GitHub raw content: {github_url}")
+        
         # Fetch YAML content from GitHub
-        response = requests.get(github_url)
+        response = requests.get(github_url, timeout=30)
         response.raise_for_status()  # Raise exception for bad status codes
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def parse_configmap(
github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
):
"""Parse the ConfigMap from GitHub to extract LMEval tasks"""
try:
# Fetch YAML content from GitHub
response = requests.get(github_url)
response.raise_for_status() # Raise exception for bad status codes
configmap = yaml.safe_load(response.text)
def parse_configmap(
github_url: str = "https://raw.githubusercontent.com/ruivieira/trustyai-service-operator/RHOAIENG-26883/config/configmaps/trustyai-lmeval-tasks.yaml",
):
"""Parse the ConfigMap from GitHub to extract LMEval tasks"""
try:
+ # Validate URL is from expected domain
+ if not github_url.startswith("https://raw.githubusercontent.com/"):
+ raise ValueError(f"URL must be from GitHub raw content: {github_url}")
+
# Fetch YAML content from GitHub
- response = requests.get(github_url)
+ response = requests.get(github_url, timeout=30)
response.raise_for_status() # Raise exception for bad status codes
configmap = yaml.safe_load(response.text)
🤖 Prompt for AI Agents
In tests/model_explainability/lm_eval/utils/parse_configmap.py around lines 15
to 23, the function fetches data from an external URL without validating the URL
or setting a timeout, which can cause security and reliability issues. Add
validation to ensure the github_url is a valid and expected URL format, and
include a timeout parameter in the requests.get call to prevent hanging
requests. This will improve security and robustness when fetching the ConfigMap.

@adolfo-ab adolfo-ab added the wip label Jun 26, 2025
@artemsa223 artemsa223 marked this pull request as draft June 26, 2025 17:42
@adolfo-ab adolfo-ab closed this Jul 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants