Skip to content

Conversation

@yaoyu-33
Copy link
Contributor

@yaoyu-33 yaoyu-33 commented Nov 12, 2025

This PR removes implicit reliance on global parallel state within the Bridge training stack by introducing an explicit ProcessGroupCollection flow from model construction through training steps. Model providers, steps, and utilities now derive and thread pg_collection directly, improving correctness, testability, and future extensibility (e.g., multiple model instances or heterogeneous groups).

Key Changes

  • Introduce get_pg_collection(model) utility to retrieve the ProcessGroupCollection from wrapped models.
    • Added: src/megatron/bridge/training/utils/pg_utils.py
  • Update ModelProvider to construct and pass pg_collection:
    • ModelProviderMixin.provide_distributed_model(...) initializes model-parallel (if needed), retrieves pg_collection via ProcessGroupCollection.use_mpu_process_groups(), composes hooks, and passes pg_collection into model creation.
    • get_model(...) now requires keyword-only pg_collection: ProcessGroupCollection and uses it for pipeline/tensor parallel stage decisions, mixed precision wrapping, DDP/FSDP wrapping, and rank-scoped logging.
    • File: src/megatron/bridge/models/model_provider.py
  • GPT and VLM training steps now retrieve pg_collection from the model instead of using global parallel_state:
    • GPT: src/megatron/bridge/training/gpt_step.py
    • VLM: src/megatron/bridge/training/vlm_step.py (also adds robust padding/truncation for PP/no-PP shapes and includes visual_inputs handling)
  • Main training loop and step APIs thread pg_collection explicitly:
    • train.train(..., pg_collection=...) and train_step(..., pg_collection=...)
    • Uses pg_collection for reductions, DP/MP checks, PP stage gating, and CUDA graph helpers
    • File: src/megatron/bridge/training/train.py
  • Misc training utilities align to the explicit pg_collection design:
    • Batch shaping and PP/CP decisions via pg_collection.pp/pg_collection.cp
    • File: src/megatron/bridge/training/utils/batch_utils.py
  • Provider updates for GPT/T5/Gemma/Mamba to adapt to new provider/get_model signatures.
    • Files under src/megatron/bridge/models/*_provider.py

Rationale

  • Eliminates hidden coupling to global parallel_state
  • Enables clearer contracts and unit tests (injectable pg_collection)
  • Prepares for more advanced process group topologies while preserving current MPU defaults

Breaking/Behavior Changes

  • get_model(...) now requires pg_collection (keyword-only). Typical users calling ModelProviderMixin.provide_distributed_model(...) are unaffected, as the provider constructs and supplies it.
  • train.train(...) and train_step(...) now take a pg_collection argument. Callers that invoked these directly must pass the collection (obtainable via get_pg_collection(model)).
  • GPT/VLM steps derive PP/CP roles via pg_collection rather than global state.

yaoyu-33 and others added 24 commits October 23, 2025 12:24
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
# Conflicts:
#	src/megatron/bridge/training/eval.py
#	src/megatron/bridge/training/gpt_step.py
Signed-off-by: yaoyu-33 <[email protected]>
# Conflicts:
#	src/megatron/bridge/training/tensor_inspect.py
#	src/megatron/bridge/training/train.py
#	tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 requested a review from ananthsub November 12, 2025 22:15
@yaoyu-33
Copy link
Contributor Author

/ok to test 71f43cb

Signed-off-by: yaoyu-33 <[email protected]>
@yaoyu-33
Copy link
Contributor Author

/ok to test 522acef

@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
@copy-pr-bot copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants