[Megatron] feat: Support compatibility enhancements of vp_stage#5580
[Megatron] feat: Support compatibility enhancements of vp_stage#5580ChibiQuest wants to merge 6 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces compatibility enhancements for vp_stage in Megatron, primarily by dynamically checking the signature of get_transformer_layer_offset. My review focuses on a critical bug and a performance issue in this implementation. The current code will raise a NameError due to a missing import and also introduces a performance bottleneck by repeatedly inspecting a function signature in a hot path. I've provided a suggestion to resolve both issues by performing the check once at module import time.
| sig = inspect.signature(get_transformer_layer_offset) | ||
|
|
||
| if "vp_stage" in sig.parameters: | ||
| offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank) | ||
| else: | ||
| offset = get_transformer_layer_offset(tf_config) | ||
|
|
There was a problem hiding this comment.
This change introduces two issues:
NameError: Theinspectmodule is used without being imported, which will cause a runtimeNameError.- Performance: Calling
inspect.signature()insideget_current_rank_layer_infois inefficient. This function is in a hot path (called per micro-batch), and the signature ofget_transformer_layer_offsetwill not change at runtime. This check should be performed only once when the module is imported.
To fix this, please add import inspect at the top of the file and cache the result of the signature check in a module-level variable.
For example, at the top of verl/utils/megatron/router_replay_utils.py:
import inspect
from megatron.core.transformer.transformer_layer import get_transformer_layer_offset
_GET_TRANSFORMER_LAYER_OFFSET_HAS_VP_STAGE = "vp_stage" in inspect.signature(get_transformer_layer_offset).parametersThen, you can use this cached variable here.
| sig = inspect.signature(get_transformer_layer_offset) | |
| if "vp_stage" in sig.parameters: | |
| offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank) | |
| else: | |
| offset = get_transformer_layer_offset(tf_config) | |
| if _GET_TRANSFORMER_LAYER_OFFSET_HAS_VP_STAGE: | |
| offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank) | |
| else: | |
| offset = get_transformer_layer_offset(tf_config) | |
| sig = inspect.signature(get_transformer_layer_offset) | ||
| if "vp_stage" in sig.parameters: | ||
| layer_offset = get_transformer_layer_offset(config, vp_stage=vp_stage, pp_rank=pp_rank) | ||
| else: |
There was a problem hiding this comment.
pp_rank is not defined in TransformerConfig while megatron version == v0.12.1
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.