M4: Model providers update to use pg_collection #1326

yaoyu-33 · 2025-11-12T22:10:27Z

This PR removes implicit reliance on global parallel state within the Bridge training stack by introducing an explicit ProcessGroupCollection flow from model construction through training steps. Model providers, steps, and utilities now derive and thread pg_collection directly, improving correctness, testability, and future extensibility (e.g., multiple model instances or heterogeneous groups).

Key Changes

Introduce get_pg_collection(model) utility to retrieve the ProcessGroupCollection from wrapped models.
- Added: src/megatron/bridge/training/utils/pg_utils.py
Update ModelProvider to construct and pass pg_collection:
- ModelProviderMixin.provide_distributed_model(...) initializes model-parallel (if needed), retrieves pg_collection via ProcessGroupCollection.use_mpu_process_groups(), composes hooks, and passes pg_collection into model creation.
- get_model(...) now requires keyword-only pg_collection: ProcessGroupCollection and uses it for pipeline/tensor parallel stage decisions, mixed precision wrapping, DDP/FSDP wrapping, and rank-scoped logging.
- File: src/megatron/bridge/models/model_provider.py
GPT and VLM training steps now retrieve pg_collection from the model instead of using global parallel_state:
- GPT: src/megatron/bridge/training/gpt_step.py
- VLM: src/megatron/bridge/training/vlm_step.py (also adds robust padding/truncation for PP/no-PP shapes and includes visual_inputs handling)
Main training loop and step APIs thread pg_collection explicitly:
- train.train(..., pg_collection=...) and train_step(..., pg_collection=...)
- Uses pg_collection for reductions, DP/MP checks, PP stage gating, and CUDA graph helpers
- File: src/megatron/bridge/training/train.py
Misc training utilities align to the explicit pg_collection design:
- Batch shaping and PP/CP decisions via pg_collection.pp/pg_collection.cp
- File: src/megatron/bridge/training/utils/batch_utils.py
Provider updates for GPT/T5/Gemma/Mamba to adapt to new provider/get_model signatures.
- Files under src/megatron/bridge/models/*_provider.py

Rationale

Eliminates hidden coupling to global parallel_state
Enables clearer contracts and unit tests (injectable pg_collection)
Prepares for more advanced process group topologies while preserving current MPU defaults

Breaking/Behavior Changes

get_model(...) now requires pg_collection (keyword-only). Typical users calling ModelProviderMixin.provide_distributed_model(...) are unaffected, as the provider constructs and supplies it.
train.train(...) and train_step(...) now take a pg_collection argument. Callers that invoked these directly must pass the collection (obtainable via get_pg_collection(model)).
GPT/VLM steps derive PP/CP roles via pg_collection rather than global state.

Signed-off-by: yaoyu-33 <[email protected]>

# Conflicts: # src/megatron/bridge/training/eval.py # src/megatron/bridge/training/gpt_step.py

Signed-off-by: yaoyu-33 <[email protected]>

# Conflicts: # src/megatron/bridge/training/tensor_inspect.py # src/megatron/bridge/training/train.py # tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py

copy-pr-bot · 2025-11-12T22:10:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2025-11-12T22:16:04Z

/ok to test 71f43cb

Signed-off-by: yaoyu-33 <[email protected]>

yaoyu-33 · 2025-11-12T23:20:49Z

/ok to test 522acef

yaoyu-33 and others added 24 commits October 23, 2025 12:24

retrieve PGCollection from legacy globals via parallel_state in setup

71821c2

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/0_prepare

911ec14

fix setup

3f7ff31

Signed-off-by: yaoyu-33 <[email protected]>

pass pg_collection directly not leverage global state

57c971a

Signed-off-by: yaoyu-33 <[email protected]>

add unit test

b6a2b59

Signed-off-by: yaoyu-33 <[email protected]>

license

70ae249

Signed-off-by: yaoyu-33 <[email protected]>

lint

14607ba

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/0_prepare

e5acfd9

fix unit tests

224e1a3

Signed-off-by: yaoyu-33 <[email protected]>

fix pretrain api

7ca7dee

Signed-off-by: yaoyu-33 <[email protected]>

remove parallel_state from train.py

05939dc

Signed-off-by: yaoyu-33 <[email protected]>

update gpt_step and vlm_step to not rely on parallel_state

bac52e2

Signed-off-by: yaoyu-33 <[email protected]>

add util to get pg collection from model

0a2e29f

Signed-off-by: yaoyu-33 <[email protected]>

remove parallel state from train utils

384488b

Signed-off-by: yaoyu-33 <[email protected]>

unit test update

aa82d5e

Signed-off-by: yaoyu-33 <[email protected]>

unit tests fixes

1b54119

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/1_train_loops_and_steps

2f57a7b

# Conflicts: # src/megatron/bridge/training/eval.py # src/megatron/bridge/training/gpt_step.py

update get_pg_collection to use get_attr_wrapped_model

3df91bf

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/1_train_loops_and_steps

10acaad

update model provider to m4

c8e6636

Signed-off-by: yaoyu-33 <[email protected]>

update model providers for m4

44c9bb4

Signed-off-by: yaoyu-33 <[email protected]>

fix model provider unit tests

6a5a16b

Signed-off-by: yaoyu-33 <[email protected]>

fix unit tests

ca797f8

Signed-off-by: yaoyu-33 <[email protected]>

Merge branch 'main' into m4/3_model_provider

71f43cb

# Conflicts: # src/megatron/bridge/training/tensor_inspect.py # src/megatron/bridge/training/train.py # tests/unit_tests/models/test_gpt_full_te_layer_autocast_spec.py

yaoyu-33 requested a review from ananthsub November 12, 2025 22:15

copy-pr-bot bot had a problem deploying to nemo-ci November 12, 2025 22:16 Error

lint

522acef

Signed-off-by: yaoyu-33 <[email protected]>

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 22:47 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 23:20 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 23:25 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 1, 2026 23:25 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 23:25 Inactive

copy-pr-bot bot deployed to nemo-ci January 1, 2026 23:25 Active

copy-pr-bot bot had a problem deploying to nemo-ci January 1, 2026 23:25 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 1, 2026 23:25 Inactive

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Queued

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Queued

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Queued

copy-pr-bot bot requested a deployment to nemo-ci January 1, 2026 23:25 Abandoned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

M4: Model providers update to use pg_collection #1326

M4: Model providers update to use pg_collection #1326

yaoyu-33 commented Nov 12, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

yaoyu-33 commented Nov 12, 2025

Uh oh!

yaoyu-33 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

M4: Model providers update to use pg_collection #1326

Are you sure you want to change the base?

M4: Model providers update to use pg_collection #1326

Conversation

yaoyu-33 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Rationale

Breaking/Behavior Changes

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

yaoyu-33 commented Nov 12, 2025

Uh oh!

yaoyu-33 commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyu-33 commented Nov 12, 2025 •

edited

Loading