-
Notifications
You must be signed in to change notification settings - Fork 19
feat(vllm): add model_or_checkpoint param to vllm deployment #566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Adam Rajfer <[email protected]>
| warnings.warn( | ||
| "cfg.deployment.checkpoint_path will be deprecated in future versions. " | ||
| "Please use cfg.deployment.model_or_checkpoint instead and set it to " | ||
| "the path to the checkpoint inside the container. Remember to add the checkpoint " | ||
| "to the mounts list in the execution.mounts.deployment section as well.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we usually are adding it to tests, e.g. https://github.com/NVIDIA-NeMo/Evaluator/actions/runs/20020951800/job/57407581252#step:3:4163
| # | ||
| type: vllm | ||
| image: vllm/vllm-openai:latest | ||
| checkpoint_path: ??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment: This is a common argument now for all deployments, so we should probably preserve unification.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
WalkthroughThe changes introduce a deprecation pathway for checkpoint configuration in vLLM deployments. The vllm.yaml file now uses a fallback selection logic (attempting Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py (1)
605-612: Clarify the deprecation message with concrete migration example.The deprecation warning describes the migration path, but could be clearer about the exact steps. Consider adding a concrete example:
Apply this diff to improve the deprecation message:
warnings.warn( - "cfg.deployment.checkpoint_path will be deprecated in future versions. " - "Please use cfg.deployment.model_or_checkpoint instead and set it to " - "the path to the checkpoint inside the container. Remember to add the checkpoint " - "to the mounts list in the execution.mounts.deployment section as well.", + "cfg.deployment.checkpoint_path will be deprecated in future versions. " + "Migration: (1) Set cfg.deployment.model_or_checkpoint='/checkpoint', " + "(2) Add mount in cfg.execution.mounts.deployment: {'/host/checkpoint/path': '/checkpoint:ro'}. " + f"Current checkpoint_path '{checkpoint_path}' is being mounted to /checkpoint for backward compatibility.", category=DeprecationWarning, stacklevel=2, )Based on past review comment at lines 605-609, consider adding tests for this deprecation warning.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml(1 hunks)packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py(1 hunks)
🔇 Additional comments (1)
packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm.yaml (1)
32-32: Nestedoc.selectresolver syntax is valid.The code uses a supported OmegaConf 2.1+ pattern where nested resolvers can chain fallback logic. The syntax correctly implements the three-level fallback (model_or_checkpoint → hf_model_handle → /checkpoint) and matches documented patterns.
marta-sd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of better separating deployment from execution, but your PR introduces several inconsistencies. Let's take a step back and make sure we don't break other things when addressing this problem.
| health: /health | ||
|
|
||
| command: vllm serve ${oc.select:deployment.hf_model_handle,/checkpoint} | ||
| command: vllm serve ${oc.select:deployment.model_or_checkpoint,${oc.select:deployment.hf_model_handle,/checkpoint}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several problems with this change:
- I don't see the new field anywhere in the default config. How users are suppose to know what param they need to specify?
- SGlang has the identical logic for checkpoint/model handle selection and you haven't updated it
- what about trt-llm and nim? We still have
checkpoint_pathparameter there and the code in the executor is shared for all deployment methods. To make things worse, with the change you propose for slurm executor users will get a deprecation warning with no available path to update. - why do we still have
hf_model_handlehere? shouldn'tmodel_or_checkpointfully replace it?
| if cfg.deployment.type != "none": | ||
| if checkpoint_path := cfg.deployment.get("checkpoint_path"): | ||
| warnings.warn( | ||
| "cfg.deployment.checkpoint_path will be deprecated in future versions. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is very problematic: you're deprecating a deployment parameter for only one executor. This breaks deployment-execution separation. Also, what's the reason for that? Why do we want to treat slurm differently? The mounts logic for local is identical, I don't see any reasons to diverge.
Summary by CodeRabbit
New Features
Chores
model_or_checkpointapproach for future compatibility.✏️ Tip: You can customize this high-level summary in your review settings.