feat(vllm): add model_or_checkpoint param to vllm deployment #566

AdamRajfer · 2025-12-16T16:29:24Z

Summary by CodeRabbit

New Features
- Enhanced model selection logic for vLLM with improved fallback configuration options.
Chores
- Added deprecation notice for checkpoint configuration path; users should migrate to the model_or_checkpoint approach for future compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Adam Rajfer <[email protected]>

copy-pr-bot · 2025-12-16T16:29:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

AWarno · 2025-12-16T16:41:00Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

+            warnings.warn(
+                "cfg.deployment.checkpoint_path will be deprecated in future versions. "
+                "Please use cfg.deployment.model_or_checkpoint instead and set it to "
+                "the path to the checkpoint inside the container. Remember to add the checkpoint "
+                "to the mounts list in the execution.mounts.deployment section as well.",


we usually are adding it to tests, e.g. https://github.com/NVIDIA-NeMo/Evaluator/actions/runs/20020951800/job/57407581252#step:3:4163

AWarno · 2025-12-16T16:52:27Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml

 #
 type: vllm
 image: vllm/vllm-openai:latest
-checkpoint_path: ???


Comment: This is a common argument now for all deployments, so we should probably preserve unification.

marta-sd · 2025-12-16T17:59:34Z

@CodeRabbit review

coderabbitai · 2025-12-16T17:59:40Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-12-16T18:03:16Z

Walkthrough

The changes introduce a deprecation pathway for checkpoint configuration in vLLM deployments. The vllm.yaml file now uses a fallback selection logic (attempting model_or_checkpoint before falling back to hf_model_handle), while the Slurm executor emits a deprecation warning when the legacy checkpoint_path is used, encouraging migration to model_or_checkpoint with /checkpoint mounting.

Changes

Cohort / File(s)	Summary
Configuration Updates `packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml`	Modified command line argument resolution to use nested `oc.select` for model selection, prioritizing `deployment.model_or_checkpoint` with fallback to `deployment.hf_model_handle`
Deprecation Warnings `packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py`	Added `DeprecationWarning` when `cfg.deployment.checkpoint_path` is provided, guiding users to migrate to `cfg.deployment.model_or_checkpoint` with `/checkpoint` mounting

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Configuration update follows a straightforward fallback pattern with clear intent
Deprecation warning is a standard, isolated addition with no behavioral changes to mounting logic
Both changes are localized and consistent in their migration guidance pattern

Poem

🐰 A checkpoint path once stood so tall,
Now whispers softly down the hall—
"Migrate to newer ways," it sighs,
With fallback logic, safe and wise,
The path evolves, yet still complies! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a model_or_checkpoint parameter to the vLLM deployment configuration, which is reflected in both the vllm.yaml and executor.py modifications.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch arajfer/add-vllm-model-or-checkpoint-param

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (1)

605-612: Clarify the deprecation message with concrete migration example.

The deprecation warning describes the migration path, but could be clearer about the exact steps. Consider adding a concrete example:

Apply this diff to improve the deprecation message:

         warnings.warn(
-            "cfg.deployment.checkpoint_path will be deprecated in future versions. "
-            "Please use cfg.deployment.model_or_checkpoint instead and set it to "
-            "the path to the checkpoint inside the container. Remember to add the checkpoint "
-            "to the mounts list in the execution.mounts.deployment section as well.",
+            "cfg.deployment.checkpoint_path will be deprecated in future versions. "
+            "Migration: (1) Set cfg.deployment.model_or_checkpoint='/checkpoint', "
+            "(2) Add mount in cfg.execution.mounts.deployment: {'/host/checkpoint/path': '/checkpoint:ro'}. "
+            f"Current checkpoint_path '{checkpoint_path}' is being mounted to /checkpoint for backward compatibility.",
             category=DeprecationWarning,
             stacklevel=2,
         )

Based on past review comment at lines 605-609, consider adding tests for this deprecation warning.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03e8eec and 1bd0d6e.

📒 Files selected for processing (2)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml (1 hunks)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (1 hunks)

🔇 Additional comments (1)

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml (1)

32-32: Nested oc.select resolver syntax is valid.

The code uses a supported OmegaConf 2.1+ pattern where nested resolvers can chain fallback logic. The syntax correctly implements the three-level fallback (model_or_checkpoint → hf_model_handle → /checkpoint) and matches documented patterns.

marta-sd

I like the idea of better separating deployment from execution, but your PR introduces several inconsistencies. Let's take a step back and make sure we don't break other things when addressing this problem.

marta-sd · 2025-12-18T07:34:08Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml

  health: /health

-command: vllm serve ${oc.select:deployment.hf_model_handle,/checkpoint}
+command: vllm serve ${oc.select:deployment.model_or_checkpoint,${oc.select:deployment.hf_model_handle,/checkpoint}}


There are several problems with this change:

I don't see the new field anywhere in the default config. How users are suppose to know what param they need to specify?

SGlang has the identical logic for checkpoint/model handle selection and you haven't updated it

what about trt-llm and nim? We still have checkpoint_path parameter there and the code in the executor is shared for all deployment methods. To make things worse, with the change you propose for slurm executor users will get a deprecation warning with no available path to update.

why do we still have hf_model_handle here? shouldn't model_or_checkpoint fully replace it?

marta-sd · 2025-12-18T07:37:20Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

    if cfg.deployment.type != "none":
        if checkpoint_path := cfg.deployment.get("checkpoint_path"):
+            warnings.warn(
+                "cfg.deployment.checkpoint_path will be deprecated in future versions. "


This change is very problematic: you're deprecating a deployment parameter for only one executor. This breaks deployment-execution separation. Also, what's the reason for that? Why do we want to treat slurm differently? The mounts logic for local is identical, I don't see any reasons to diverge.

Add model_or_checkpoint param to vllm deployment

1bd0d6e

Signed-off-by: Adam Rajfer <[email protected]>

AdamRajfer requested review from a team as code owners December 16, 2025 16:29

github-actions bot added the nemo-evaluator-launcher label Dec 16, 2025

AWarno reviewed Dec 16, 2025

View reviewed changes

coderabbitai bot reviewed Dec 16, 2025

View reviewed changes

marta-sd requested changes Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vllm): add model_or_checkpoint param to vllm deployment #566

feat(vllm): add model_or_checkpoint param to vllm deployment #566

Uh oh!

AdamRajfer commented Dec 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Dec 16, 2025

Uh oh!

AWarno Dec 16, 2025

Uh oh!

AWarno Dec 16, 2025

Uh oh!

marta-sd commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

marta-sd left a comment

Uh oh!

marta-sd Dec 18, 2025

Uh oh!

marta-sd Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(vllm): add model_or_checkpoint param to vllm deployment #566

Are you sure you want to change the base?

feat(vllm): add model_or_checkpoint param to vllm deployment #566

Uh oh!

Conversation

AdamRajfer commented Dec 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Dec 16, 2025

Uh oh!

AWarno Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

AWarno Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

marta-sd commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025

Uh oh!

coderabbitai bot commented Dec 16, 2025

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

marta-sd left a comment

Choose a reason for hiding this comment

Uh oh!

marta-sd Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

marta-sd Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AdamRajfer commented Dec 16, 2025 •

edited by coderabbitai bot

Loading