[WIP] 2D parallelism for training #2318

joecummings · 2025-01-29T23:48:17Z

No description provided.

pytorch-bot · 2025-01-29T23:48:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2318

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Cancelled Jobs

As of commit cfd31cd with merge base d4465c8 ():

NEW FAILURES - The following jobs have failed:

Build Docs / build_docs (3.11) (gh)
GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3-tune-4-1-True]
Lint / lint (3.10) (gh)
Process completed with exit code 1.

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_full_finetune_distributed.py::TestFullFinetuneDistributedRecipe::test_training_state_on_resume_from_distributed_checkpoint_multi_rank[llama3/8B_full-llama3-tune-4-1-True]
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2025-01-30T21:37:58Z

Codecov Report

Attention: Patch coverage is 6.66667% with 28 lines in your changes missing coverage. Please review.

Project coverage is 23.64%. Comparing base (46d8153) to head (cfd31cd).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	15 Missing ⚠️
torchtune/training/_distributed.py	13.33%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2318      +/-   ##
==========================================
- Coverage   23.87%   23.64%   -0.24%     
==========================================
  Files         359      361       +2     
  Lines       21260    21488     +228     
==========================================
+ Hits         5076     5080       +4     
- Misses      16184    16408     +224

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SalmanMohammadi · 2025-01-31T17:04:03Z

recipes/full_finetune_distributed.py

+        # Distributed variables
+        self.world_size, self.rank = utils.get_world_size_and_rank()
+        self._is_rank_zero = self.rank == 0
+        self.parallelize_plan = config.instantiate(cfg.get("parallelize_plan"))


FYI the default parallelize plan we have didn't work when I was using it with the smaller Llama3 models with tied embeddings.

Initial commit

df17170

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 29, 2025

joecummings added 2 commits January 29, 2025 16:10

Updates

6e8041d

World SIZEEE

cfd31cd

SalmanMohammadi reviewed Jan 31, 2025

View reviewed changes

acisseJZhong mentioned this pull request Feb 2, 2025

TP + FSDP distributed training (full finetuning) #2330

Merged

13 tasks

joecummings closed this Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] 2D parallelism for training #2318

[WIP] 2D parallelism for training #2318

joecummings commented Jan 29, 2025

pytorch-bot bot commented Jan 29, 2025 •

edited

Loading

codecov-commenter commented Jan 30, 2025

SalmanMohammadi Jan 31, 2025

[WIP] 2D parallelism for training #2318

[WIP] 2D parallelism for training #2318

Conversation

joecummings commented Jan 29, 2025

pytorch-bot bot commented Jan 29, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2318

❌ 3 New Failures, 2 Cancelled Jobs

codecov-commenter commented Jan 30, 2025

Codecov Report

SalmanMohammadi Jan 31, 2025

Choose a reason for hiding this comment

pytorch-bot bot commented Jan 29, 2025 •

edited

Loading