Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Dec 18, 2025

Stack from ghstack (oldest at bottom):

As title

[ghstack-poisoned]
fegin added a commit that referenced this pull request Dec 18, 2025
As title


ghstack-source-id: 0131fe2
Pull-Request: #2162
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 18, 2025
Copy link
Contributor

@wwwjn wwwjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please fix lint error

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugging.md is becoming huge. We could either split it into multiple ones in a docs/debugging/ folder, or create a table of contents in the single file.

- Simulates multi-GPU behavior on a single shared GPU
- Executes all collectives (all-reduce, all-gather, etc.) locally without network communication
- Maintains the same code paths as distributed training for accurate debugging
- Runs only one training step by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why having this as default?

- Runs only one training step by default

**When to use it:**
- Debugging distributed training logic (FSDP, TP, PP, CP, EP) with data dependencies without multi-GPU setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note that fsdp doesn't work today?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants