Skip to content

Conversation

@AWarno
Copy link
Contributor

@AWarno AWarno commented Dec 18, 2025

  1. Fix the Slurm health URL.
  2. Add NIM examples that require these health endpoints.

Summary by CodeRabbit

  • New Features

    • Added example configurations for deploying NIM evaluators locally and on SLURM clusters, including pre-configured evaluation tasks (ifeval, gsm8k) and environment setup.
  • Refactor

    • Improved health check endpoint configuration for consistent resolution across different deployment scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Anna Warno <[email protected]>
@AWarno AWarno requested review from a team as code owners December 18, 2025 13:29
@AWarno AWarno changed the title Awarno/fix health url fix(health-url): fix health url Dec 18, 2025
NGC_API_KEY: ${oc.env:NGC_API_KEY} # Required for NIM container authentication
mounts:
deployment:
/path/to/nim/cache: /opt/nim/.cache # Mount NIM cache directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add info to instruction above that this path needs to be set

- _self_

# SLURM execution configuration
execution:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we use NIM cache for slurm?

@AWarno
Copy link
Contributor Author

AWarno commented Dec 18, 2025

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Dec 18, 2025

Walkthrough

Two new example configurations for Nim-based evaluator deployment (local and SLURM) are added, alongside updates to the core Nim deployment configuration and SLURM executor to standardize health check endpoint paths from /health to /v1/health/ready.

Changes

Cohort / File(s) Summary
Example Configurations
packages/nemo-evaluator-launcher/examples/local_nim.yaml, packages/nemo-evaluator-launcher/examples/slurm_nim.yaml
New YAML example files demonstrating local and SLURM-based Nim evaluator deployment with configurations for execution, deployment, and evaluation tasks (ifeval, gsm8k).
Deployment Configuration
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/nim.yaml
Added command field (/opt/nim/start_server.sh) and updated health endpoint from /health to /v1/health/ready.
Executor Logic
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py
Updated health check path resolution across multiple methods to consistently use cfg.deployment.endpoints.health instead of cfg.deployment.health_check_path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Review of two new example YAML files for correctness and consistency with existing patterns
  • Verify health endpoint path changes align across deployment configuration and executor logic
  • Confirm proxy configuration health check path defaults properly fall back to endpoints.health

Poem

A pair of new configs hops in today,
Local and SLURM show the Nim-powered way,
Health checks now point to /v1/health/ready,
The executor's aligned, consistent and steady,
Evaluations leap forward, they're healthy and keen! 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'fix(health-url): fix health url' is vague and repetitive, using non-descriptive language that doesn't clearly convey the specific changes made. Improve the title to be more specific and descriptive, such as 'fix(health-url): update health check endpoints in deployment and executor' to better reflect the actual changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch awarno/fix-health-url

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 359b97d and 1b68ecf.

📒 Files selected for processing (4)
  • packages/nemo-evaluator-launcher/examples/local_nim.yaml (1 hunks)
  • packages/nemo-evaluator-launcher/examples/slurm_nim.yaml (1 hunks)
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/nim.yaml (1 hunks)
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-10T21:34:56.965Z
Learnt from: marta-sd
Repo: NVIDIA-NeMo/Evaluator PR: 523
File: packages/nemo-evaluator-launcher/examples/local_reasoning.yaml:55-55
Timestamp: 2025-12-10T21:34:56.965Z
Learning: When editing YAML configuration in the Nemo Evaluator launcher examples, align mapping.toml section names with the actual container/image naming convention used by simple-evals (hyphenated, e.g., 'simple-evals'). This ensures consistency between the container naming and config sections. Apply this guideline to all YAML files under packages/nemo-evaluator-launcher/examples that define mapping.toml sections, and adjust section keys to match the container/framework naming convention used by the published image.

Applied to files:

  • packages/nemo-evaluator-launcher/examples/slurm_nim.yaml
  • packages/nemo-evaluator-launcher/examples/local_nim.yaml
🔇 Additional comments (7)
packages/nemo-evaluator-launcher/examples/local_nim.yaml (1)

16-27: LGTM! Clear usage instructions and warnings.

The usage instructions are well-structured and include appropriate warnings about test-only configurations. The step-by-step guide will help users configure the example correctly.

</review_comment_end>

packages/nemo-evaluator-launcher/examples/slurm_nim.yaml (1)

16-30: LGTM! Comprehensive usage instructions.

The usage instructions effectively guide users through the configuration process, including CLI argument examples for quick testing. The warnings about full evaluations are appropriate and clear.

</review_comment_end>

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (3)

646-646: LGTM! Consistent health endpoint resolution.

The health check path resolution has been properly updated to use cfg.deployment.endpoints.get("health", "/health"), which aligns with the standardized endpoint configuration in nim.yaml.

</review_comment_end>


1266-1270: LGTM! Proper fallback chain for HAProxy health checks.

The health check path resolution correctly prioritizes the proxy-specific configuration first, then falls back to the deployment endpoints configuration. The comment clearly explains the fallback behavior.

</review_comment_end>


1305-1305: LGTM! Consistent endpoint usage in HAProxy config generation.

The health check path consistently uses cfg.deployment.endpoints.get("health", "/health"), maintaining the same pattern across all health check resolution points in the file.

</review_comment_end>

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/nim.yaml (2)

29-29: Health endpoint /v1/health/ready is correct for NIM containers.

The /v1/health/ready endpoint is the standard health check endpoint for NVIDIA NIM, and this configuration aligns with official documentation across all NIM variants.


21-24: Clarify the command override contradiction.

Line 21 adds an explicit command: /opt/nim/start_server.sh, but the comment on line 23 states "NIM containers use default entrypoint - no custom command needed." In Kubernetes, specifying a command field overrides the container's default entrypoint, making these statements contradictory. Either the command field is required and the comment should be updated, or the command field is unnecessary and should be removed to use the container's default entrypoint.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants