You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Update zero.md
Update to ZeRO tutorial to specify the use of activation checkpointing
* Update zero-offload.md
Use activation checkpointing with ZeRO-Offload
Co-authored-by: Jeff Rasley <[email protected]>
Copy file name to clipboardexpand all lines: docs/_tutorials/zero-offload.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -15,17 +15,17 @@ For this tutorial, we will configure a 10 billion parameter GPT-2 model using th
15
15
We need to make changes to the Megatron-LM launch script and to the DeepSpeed configuration json.
16
16
17
17
### Megatron-LM GPT-2 launch script changes
18
-
We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model, which can be achieved by the following set of changes:
18
+
We need to apply two changes to the launch script for the DeepSpeed Megatron-LM GPT-2 model. The first change is to configure a 10B parameter GPT-2 model with activation checkpointing enabled, which can be achieved by the following set of changes:
19
19
20
20
```bash
21
21
--model-parallel-size 1 \
22
22
--num-layers 50 \
23
23
--hidden-size 4096 \
24
24
--num-attention-heads 32 \
25
25
--batch-size 10 \
26
-
--d \
27
26
--deepspeed_config ds_zero_offload.config \
28
27
--cpu_optimizer \
28
+
--checkpoint-activations
29
29
```
30
30
31
31
Most of the flags in the changes above should be familiar if you have stepped through the Megatron-LM [tutorial](/tutorials/megatron/), except for the **_--cpu_optimizer_**. This flag informs the model script to pass a CPU-based Adam optimizer, rather than a GPU-based one, to DeepSpeed as the client optimizer. It is very important that this flag be used when training with ZeRO-Offload to ensure correct operation of the DeepSpeed engine.
Copy file name to clipboardexpand all lines: docs/_tutorials/zero.md
+3-4
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,6 @@ We demonstrate the benefits of ZeRO stage 1 by showing that it enables data para
27
27
--hidden-size 1600 \
28
28
--num-attention-heads 16 \
29
29
--batch-size 1 \
30
-
--d \
31
30
--deepspeed_config ds_zero_stage_1.config \
32
31
```
33
32
@@ -53,16 +52,16 @@ As seen above, we set two fields in the **zero_optimization** key. Specifically
53
52
From the nvidia-smi screenshot above we can see that that only GPUs 0--7 are being used for training the model. With ZeRO stage 1 we can further reduce the per-device memory consumption by increasing the data parallelism degree. These memory savings can be leveraged to either increase model size and/or batch size. In contrast, such benefits are not possible with data parallelism alone.
54
53
55
54
### Training a 10B Parameter GPT-2 model
56
-
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
55
+
ZeRO stage 2 optimizations further increases the size of models that can be trained using data parallelism. We show this training a model with 10B parameters using 32 V100 GPUs. First, we need to configure a 10B parameter model with activation checkpointing enabled. This can be done by applying the following GPT-2 model configuration changes to the DeepSpeed launch script.
57
56
58
57
```bash
59
58
--model-parallel-size 1 \
60
59
--num-layers 50 \
61
60
--hidden-size 4096 \
62
61
--num-attention-heads 32 \
63
62
--batch-size 1 \
64
-
--d \
65
63
--deepspeed_config ds_zero_stage_2.config \
64
+
--checkpoint-activations
66
65
```
67
66
68
67
Next, we need to update the DeepSpeed json configuration, as shown below, to enable ZeRO stage 2 optimizations:
@@ -80,7 +79,7 @@ Next, we need to update the DeepSpeed json configuration, as shown below, to ena
80
79
}
81
80
```
82
81
83
-
In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now run the launch the training run.
82
+
In the above changes, we have set the _stage_ field to 2, and configured other optimization knobs that are available in ZeRO stage 2. For example, we have enabled _contiguous_gradients_ to reduce memory fragmenation during backward pass. A full description of these optimization knobs is available [here](/docs/config-json/#zero-optimizations-for-fp16-training). With these changes, we can now launch the training run.
0 commit comments