Skip to content

Conversation

@AdamRajfer
Copy link
Contributor

@AdamRajfer AdamRajfer commented Dec 16, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced model selection logic for vLLM with improved fallback configuration options.
  • Chores

    • Added deprecation notice for checkpoint configuration path; users should migrate to the model_or_checkpoint approach for future compatibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@AdamRajfer AdamRajfer requested review from a team as code owners December 16, 2025 16:29
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment on lines +605 to +609
warnings.warn(
"cfg.deployment.checkpoint_path will be deprecated in future versions. "
"Please use cfg.deployment.model_or_checkpoint instead and set it to "
"the path to the checkpoint inside the container. Remember to add the checkpoint "
"to the mounts list in the execution.mounts.deployment section as well.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#
type: vllm
image: vllm/vllm-openai:latest
checkpoint_path: ???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment: This is a common argument now for all deployments, so we should probably preserve unification.

@marta-sd
Copy link
Contributor

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Dec 16, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link

coderabbitai bot commented Dec 16, 2025

Walkthrough

The changes introduce a deprecation pathway for checkpoint configuration in vLLM deployments. The vllm.yaml file now uses a fallback selection logic (attempting model_or_checkpoint before falling back to hf_model_handle), while the Slurm executor emits a deprecation warning when the legacy checkpoint_path is used, encouraging migration to model_or_checkpoint with /checkpoint mounting.

Changes

Cohort / File(s) Summary
Configuration Updates
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml
Modified command line argument resolution to use nested oc.select for model selection, prioritizing deployment.model_or_checkpoint with fallback to deployment.hf_model_handle
Deprecation Warnings
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py
Added DeprecationWarning when cfg.deployment.checkpoint_path is provided, guiding users to migrate to cfg.deployment.model_or_checkpoint with /checkpoint mounting

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Configuration update follows a straightforward fallback pattern with clear intent
  • Deprecation warning is a standard, isolated addition with no behavioral changes to mounting logic
  • Both changes are localized and consistent in their migration guidance pattern

Poem

🐰 A checkpoint path once stood so tall,
Now whispers softly down the hall—
"Migrate to newer ways," it sighs,
With fallback logic, safe and wise,
The path evolves, yet still complies! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a model_or_checkpoint parameter to the vLLM deployment configuration, which is reflected in both the vllm.yaml and executor.py modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch arajfer/add-vllm-model-or-checkpoint-param

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (1)

605-612: Clarify the deprecation message with concrete migration example.

The deprecation warning describes the migration path, but could be clearer about the exact steps. Consider adding a concrete example:

Apply this diff to improve the deprecation message:

         warnings.warn(
-            "cfg.deployment.checkpoint_path will be deprecated in future versions. "
-            "Please use cfg.deployment.model_or_checkpoint instead and set it to "
-            "the path to the checkpoint inside the container. Remember to add the checkpoint "
-            "to the mounts list in the execution.mounts.deployment section as well.",
+            "cfg.deployment.checkpoint_path will be deprecated in future versions. "
+            "Migration: (1) Set cfg.deployment.model_or_checkpoint='/checkpoint', "
+            "(2) Add mount in cfg.execution.mounts.deployment: {'/host/checkpoint/path': '/checkpoint:ro'}. "
+            f"Current checkpoint_path '{checkpoint_path}' is being mounted to /checkpoint for backward compatibility.",
             category=DeprecationWarning,
             stacklevel=2,
         )

Based on past review comment at lines 605-609, consider adding tests for this deprecation warning.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03e8eec and 1bd0d6e.

📒 Files selected for processing (2)
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml (1 hunks)
  • packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (1 hunks)
🔇 Additional comments (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml (1)

32-32: Nested oc.select resolver syntax is valid.

The code uses a supported OmegaConf 2.1+ pattern where nested resolvers can chain fallback logic. The syntax correctly implements the three-level fallback (model_or_checkpoint → hf_model_handle → /checkpoint) and matches documented patterns.

Copy link
Contributor

@marta-sd marta-sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of better separating deployment from execution, but your PR introduces several inconsistencies. Let's take a step back and make sure we don't break other things when addressing this problem.

health: /health

command: vllm serve ${oc.select:deployment.hf_model_handle,/checkpoint}
command: vllm serve ${oc.select:deployment.model_or_checkpoint,${oc.select:deployment.hf_model_handle,/checkpoint}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several problems with this change:

  • I don't see the new field anywhere in the default config. How users are suppose to know what param they need to specify?
  • SGlang has the identical logic for checkpoint/model handle selection and you haven't updated it
  • what about trt-llm and nim? We still have checkpoint_path parameter there and the code in the executor is shared for all deployment methods. To make things worse, with the change you propose for slurm executor users will get a deprecation warning with no available path to update.
  • why do we still have hf_model_handle here? shouldn't model_or_checkpoint fully replace it?

if cfg.deployment.type != "none":
if checkpoint_path := cfg.deployment.get("checkpoint_path"):
warnings.warn(
"cfg.deployment.checkpoint_path will be deprecated in future versions. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is very problematic: you're deprecating a deployment parameter for only one executor. This breaks deployment-execution separation. Also, what's the reason for that? Why do we want to treat slurm differently? The mounts logic for local is identical, I don't see any reasons to diverge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants