Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/grpo_trainer/run_qwen3-8b_npu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ python3 -m verl.trainer.main_ppo \
actor_rollout_ref.ref.fsdp_config.forward_prefetch=True \
++actor_rollout_ref.actor.entropy_from_logits_with_chunking=True \
++actor_rollout_ref.ref.entropy_from_logits_with_chunking=True \
++actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=4096 \
trainer.val_before_train=True \
trainer.save_freq=5 \
trainer.test_freq=5 \
Expand Down
9 changes: 8 additions & 1 deletion verl/utils/megatron/router_replay_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,7 +330,14 @@ def get_current_rank_layer_info(tf_config, vp_rank=None):
if vp_rank is None:
vp_rank = 0
num_layers_to_build = get_num_layers_to_build(tf_config, vp_stage=vp_rank)
offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank)

sig = inspect.signature(get_transformer_layer_offset)

if "vp_stage" in sig.parameters:
offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank)
else:
offset = get_transformer_layer_offset(tf_config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces two issues:

  1. NameError: The inspect module is used without being imported, which will cause a runtime NameError.
  2. Performance: Calling inspect.signature() inside get_current_rank_layer_info is inefficient. This function is in a hot path (called per micro-batch), and the signature of get_transformer_layer_offset will not change at runtime. This check should be performed only once when the module is imported.

To fix this, please add import inspect at the top of the file and cache the result of the signature check in a module-level variable.

For example, at the top of verl/utils/megatron/router_replay_utils.py:

import inspect
from megatron.core.transformer.transformer_layer import get_transformer_layer_offset

_GET_TRANSFORMER_LAYER_OFFSET_HAS_VP_STAGE = "vp_stage" in inspect.signature(get_transformer_layer_offset).parameters

Then, you can use this cached variable here.

Suggested change
sig = inspect.signature(get_transformer_layer_offset)
if "vp_stage" in sig.parameters:
offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank)
else:
offset = get_transformer_layer_offset(tf_config)
if _GET_TRANSFORMER_LAYER_OFFSET_HAS_VP_STAGE:
offset = get_transformer_layer_offset(tf_config, vp_stage=vp_rank)
else:
offset = get_transformer_layer_offset(tf_config)

local = {}
local["start"] = offset
local["end"] = offset + num_layers_to_build
Expand Down