You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/tutorials/multinode.rst
+6-7
Original file line number
Diff line number
Diff line change
@@ -6,8 +6,7 @@ Multi-node finetuning
6
6
7
7
Congratulations! After years of being "GPU poor", you've worked hard, saved your hard earned Bitcoin and graduated to the
8
8
so-called **"GPU middle class"**. In many ways, your worries of yesteryear are gone (memory efficient training, who??).
9
-
But, new problems are on the horizon for you because multi-node is a whole new beast. Come with me as I take you
10
-
through your new life, complete with a big backyard, new car, and of course - a nice rack of H100s.
9
+
But new problems are on the horizon for you because multi-node can be a whole new beast.
11
10
12
11
.. grid:: 2
13
12
@@ -30,14 +29,14 @@ Advantages of multi-node training
30
29
More machines means more memory! This is cool for several reasons:
31
30
32
31
1. **Bigger models**: With more memory, you can train larger models such as `Llama3.1 405B <https://ai.meta.com/blog/meta-llama-3-1/>`_, `Deepseek-V3 <https://www.deepseek.com/>`_, and more.
33
-
2. **Longer data**: More many tasks like writing code, it's helpful to have long context lengths; however longer context length means more memory needed for activations.
32
+
2. **Longer data**: For many fine-tuning tasks like writing code, it's helpful to have long context lengths; however longer context length means more memory needed for activations.
34
33
3. **Higher quality**: With more memory, you can do full parameter updates (not LoRA) and use optimizers like `AdamW <https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html>`_ (not low-precision optimizers), both of which can potentially improve the quality of your training.
35
34
4. **Faster training**: With the ability to fit more data in memory, you can use higher batch sizes *and* turn off memory optimizations like :ref:`activation checkpointing<glossary_act_ckpt>` thereby decreasing the time it takes for training to complete.
36
35
37
36
.. note::
38
37
39
-
**Low inter-node bandwidth & FSDP** We utilize `Fully Sharded Data Parallel<https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`_ to distribute models over multiple devices. In order to distribute training, FSDP runs an `all-gather <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allgather>`_ operation
40
-
for each forward pass and an all-gather plus a `scatter-reduce <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter>`_ operation for each backwards pass. These operations (usually) block training from continuing until completed and with a slow
38
+
**Low inter-node bandwidth & FSDP** We utilize PyTorch's **Fully Sharded Data Parallel** to distribute models over multiple devices. In order to distribute training, FSDP runs an `all-gather <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#allgather>`_ operation
39
+
for each forward pass and an all-gather (usually) plus a `reduce-scatter<https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter>`_ operation for each backwards pass. These operations (usually) block training from continuing until completed and with a slow
41
40
inter-node connection, training speed may be reduced. For more on this, please refer to `this Github Issue <https://github.com/pytorch/pytorch/issues/102434>`_.
42
41
43
42
Training Llama3.3 70B on 2 nodes
@@ -62,7 +61,7 @@ Now that we have a downloaded model, let's check out our example SLURM bash scri
62
61
63
62
* We utilize SLURM specific commands like number of nodes, tasks, CPUs available, etc.
64
63
* We are using `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`_ and the `full_finetune_distributed <https://github.com/pytorch/torchtune/blob/main/recipes/full_finetune_distributed.py>`_ recipe to train just like on single node
65
-
* You should consider several cluster-specific environment variables to maximize GPU utilization
64
+
* You can consider several cluster-specific environment variables (``NCCL_BUFFSIZE``, ``NCCL_DEBUG``, ``FI_PROVIDER``, etc.) in order to maximize GPU utilization, debug, and more.
66
65
67
66
.. note::
68
67
@@ -83,7 +82,7 @@ And the output of `squeue <https://slurm.schedmd.com/squeue.html>`_ should show
83
82
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
84
83
1 train torchtun slurm R 0:03 2 slurm-worker-[1-2]
85
84
86
-
Once training has completed, we can follow the :ref:`instructions here<use_model_in_wild>` in order to upload our beautiful new model to the Hugging Face Hub!
85
+
Once training has completed, which should take roughly seven minutes in total with the default config, we can follow the :ref:`instructions here<use_model_in_wild>` in order to upload our beautiful new model to the Hugging Face Hub!
0 commit comments