Skip to content

Commit e897c23

Browse files
dashstanderDashiell StanderDashiell StanderDashiell Standergithub-actions
authored
Implement DeepSpeed Main autotuning for NeoX (#739)
* Add autotuning Signed-off-by: Dashiell Stander <[email protected]> * Add autotuning config * Need to add it to deepspeed args * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Need to set no_ssh_check argument with slurm.... * set master_address for SLURM * set master_address for SLURM * let json be a file ending * Write configs to json files instead of passing them in as CL arguments Signed-off-by: Dashiell Stander <[email protected]> * Write configs to json files instead of passing them in as CL arguments Signed-off-by: Dashiell Stander <[email protected]> * Pass in slurm_comment directly to DeepSpeed Signed-off-by: Dashiell Stander <[email protected]> * Move slurm_comment to deepspeed args Signed-off-by: Dashiell Stander <[email protected]> * Move slurm_comment to deepspeed args Signed-off-by: Dashiell Stander <[email protected]> * Slurm comment Signed-off-by: Dashiell Stander <[email protected]> * Slurm comment Signed-off-by: Dashiell Stander <[email protected]> * Slurm comment Signed-off-by: Dashiell Stander <[email protected]> * Move configs out of \/tmp Signed-off-by: Dashiell Stander <[email protected]> * Get values from ds_config when autotuning Signed-off-by: Dashiell Stander <[email protected]> * Get values from ds_config when autotuning Signed-off-by: Dashiell Stander <[email protected]> * Pass in autotuning config properly Signed-off-by: Dashiell Stander <[email protected]> * Debug print statement Signed-off-by: Dashiell Stander <[email protected]> * lower mem requirement in tune.sh Signed-off-by: Dashiell Stander <[email protected]> * Cursed hack to pass in autotuning config properly Signed-off-by: Dashiell Stander <[email protected]> * Cursed hack to pass in autotuning config properly Signed-off-by: Dashiell Stander <[email protected]> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <[email protected]> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <[email protected]> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <[email protected]> * So much debuggin Signed-off-by: Dashiell Stander <[email protected]> * Small bug Signed-off-by: Dashiell Stander <[email protected]> * Debugging print statements... Signed-off-by: Dashiell Stander <[email protected]> * json configs for DeepSpeed Signed-off-by: Dashiell Stander <[email protected]> * only two nodes * Needed to change up the configs * Do not actually need to do that Signed-off-by: Dashiell Stander <[email protected]> * Tune 6.7B model * New types for zero stage Signed-off-by: Dashiell Stander <[email protected]> * New types for zero stage Signed-off-by: Dashiell Stander <[email protected]> * Tuning a larger model * Always copy autotuning args from ds_config Signed-off-by: Dashiell Stander <[email protected]> * Always copy autotuning args from ds_config Signed-off-by: Dashiell Stander <[email protected]> * Cleaner this way, I think... Signed-off-by: Dashiell Stander <[email protected]> * New debug print statement * New debug print statement * Need to copy this over as well Signed-off-by: Dashiell Stander <[email protected]> * Need to copy over train_batch_size as well Signed-off-by: Dashiell Stander <[email protected]> * debug print * new configs Signed-off-by: Dashiell Stander <[email protected]> * Tests * Sync with new method of passing in autotuning configs Signed-off-by: Dashiell Stander <[email protected]> * Replicate on different cluster Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> * Update NeoXArgs docs automatically * Use typing `List` and fix bug in decoding Signed-off-by: Dashiell Stander <[email protected]> * Use checkpoint_factor Signed-off-by: Dashiell Stander <[email protected]> * Change autotuning config name Signed-off-by: Dashiell Stander <[email protected]> * Add no_ssh_check config option Signed-off-by: Dashiell Stander <[email protected]> * no_ssh_check should be a configured value Signed-off-by: Dashiell Stander <[email protected]> * Only pass in master_addr once Signed-off-by: Dashiell Stander <[email protected]> * DeepSpeed now base64 encodes ds_config Signed-off-by: Dashiell Stander <[email protected]> * whoops * still need to pass in megatron_fp Signed-off-by: Dashiell Stander <[email protected]> * still need to pass in megatron_fp Signed-off-by: Dashiell Stander <[email protected]> * Only write to file when doing autotuning Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Remove debugging configs Signed-off-by: Dashiell Stander <[email protected]> * Remove test scripts Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Remove test script Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Clean up Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Run pre-commit hooks Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * base64 error Signed-off-by: Dashiell Stander <[email protected]> * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * remove duplicated einops * Move autotuning configs into their own subdir --------- Signed-off-by: Dashiell Stander <[email protected]> Signed-off-by: Dashiell Stander <[email protected]> Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <[email protected]> Co-authored-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> Co-authored-by: github-actions <[email protected]> Co-authored-by: Quentin Anthony <[email protected]>
1 parent 68d223c commit e897c23

File tree

9 files changed

+509
-38
lines changed

9 files changed

+509
-38
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
{
2+
"pipe-parallel-size": 1,
3+
"model-parallel-size": 1,
4+
5+
"num-layers": 12,
6+
"hidden-size": 768,
7+
"num-attention-heads": 12,
8+
"seq-length": 2048,
9+
"max-position-embeddings": 2048,
10+
"norm": "layernorm",
11+
"pos-emb": "rotary",
12+
"no-weight-tying": true,
13+
14+
"scaled-upper-triang-masked-softmax-fusion": false,
15+
"bias-gelu-fusion": false,
16+
17+
18+
"optimizer": {
19+
"type": "Adam",
20+
"params": {
21+
"lr": 0.0006,
22+
"betas": [0.9, 0.999],
23+
"eps": 1.0e-8
24+
}
25+
},
26+
27+
"train_micro_batch_size_per_gpu": 1,
28+
"data-impl": "mmap",
29+
"split": "949,50,1",
30+
31+
"checkpoint-activations": true,
32+
"checkpoint-num-layers": 1,
33+
"partition-activations": true,
34+
"synchronize-each-layer": true,
35+
36+
"gradient_clipping": 1.0,
37+
"weight-decay": 0.0,
38+
"hidden-dropout": 0.0,
39+
"attention-dropout": 0.0,
40+
41+
"fp16": {
42+
"enabled": true,
43+
"loss_scale": 0,
44+
"loss_scale_window": 1000,
45+
"hysteresis": 2,
46+
"min_loss_scale": 1
47+
},
48+
49+
"train-iters": 320000,
50+
"lr-decay-iters": 320000,
51+
"distributed-backend": "nccl",
52+
"lr-decay-style": "cosine",
53+
"warmup": 0.01,
54+
"save-interval": 10000,
55+
"eval-interval": 1000,
56+
"eval-iters": 10,
57+
58+
"log-interval": 100,
59+
"steps_per_print": 10,
60+
"keep-last-n-checkpoints": 4,
61+
"wall_clock_breakdown": true,
62+
"launcher": "slurm",
63+
"deepspeed_slurm": true,
64+
"comment": "neox",
65+
"autotuning": {
66+
"enabled": true,
67+
"arg_mappings": {
68+
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
69+
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
70+
}
71+
},
72+
"zero_optimization": {
73+
"stage": [0, 1, 2, 3]
74+
},
75+
"train-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
76+
"valid-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
77+
"test-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"]
78+
}
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
{
2+
"pipe-parallel-size": 1,
3+
"model-parallel-size": 1,
4+
"num-layers": 12,
5+
"hidden-size": 768,
6+
"num-attention-heads": 12,
7+
"seq-length": 2048,
8+
"max-position-embeddings": 2048,
9+
"norm": "layernorm",
10+
"pos-emb": "rotary",
11+
"no-weight-tying": true,
12+
"scaled-upper-triang-masked-softmax-fusion": true,
13+
"bias-gelu-fusion": true,
14+
"optimizer": {
15+
"type": "Adam",
16+
"params": {
17+
"lr": 0.0006,
18+
"betas": [0.9, 0.999],
19+
"eps": 1.0e-8
20+
}
21+
},
22+
"zero_optimization": {
23+
"stage": 0,
24+
"allgather_partitions": true,
25+
"allgather_bucket_size": 500000000,
26+
"overlap_comm": true,
27+
"reduce_scatter": true,
28+
"reduce_bucket_size": 500000000,
29+
"contiguous_gradients": true,
30+
"cpu_offload": false
31+
},
32+
"train_micro_batch_size_per_gpu": 1,
33+
"autotuning_config": {
34+
"enabled": true,
35+
"arg_mappings": {
36+
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
37+
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
38+
}
39+
},
40+
"data-impl": "mmap",
41+
"split": "949,50,1",
42+
"checkpoint-activations": true,
43+
"checkpoint-num-layers": 1,
44+
"partition-activations": true,
45+
"synchronize-each-layer": true,
46+
"gradient_clipping": 1.0,
47+
"weight-decay": 0.0,
48+
"hidden-dropout": 0.0,
49+
"attention-dropout": 0.0,
50+
"fp16": {
51+
"enabled": true,
52+
"loss_scale": 0,
53+
"loss_scale_window": 1000,
54+
"hysteresis": 2,
55+
"min_loss_scale": 1
56+
},
57+
"train-iters": 200,
58+
"lr-decay-iters": 320000,
59+
"distributed-backend": "nccl",
60+
"lr-decay-style": "cosine",
61+
"warmup": 0.01,
62+
"save-interval": 10000,
63+
"eval-interval": 1000,
64+
"eval-iters": 10,
65+
"log-interval": 100,
66+
"steps_per_print": 10,
67+
"keep-last-n-checkpoints": 4,
68+
"wall_clock_breakdown": true,
69+
"launcher": "slurm",
70+
"deepspeed_slurm": true,
71+
"comment": "neox"
72+
}
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
{
2+
"pipe-parallel-size": 1,
3+
"model-parallel-size": 1,
4+
5+
"num-layers": 24,
6+
"hidden-size": 2048,
7+
"num-attention-heads": 16,
8+
"seq-length": 2048,
9+
"max-position-embeddings": 2048,
10+
"norm": "layernorm",
11+
"pos-emb": "rotary",
12+
"no-weight-tying": true,
13+
"gpt_j_residual": false,
14+
"output_layer_parallelism": "column",
15+
"attention_config": [[["flash"], 24]],
16+
"scaled-upper-triang-masked-softmax-fusion": false,
17+
"bias-gelu-fusion": false,
18+
19+
"init_method": "small_init",
20+
"output_layer_init_method": "wang_init",
21+
22+
"optimizer": {
23+
"type": "Adam",
24+
"params": {
25+
"lr": 0.0002,
26+
"betas": [0.9, 0.95],
27+
"eps": 1.0e-8
28+
}
29+
},
30+
"min_lr": 0.00002,
31+
32+
"zero_optimization": {
33+
"stage": 1,
34+
"allgather_partitions": true,
35+
"allgather_bucket_size": 500000000,
36+
"overlap_comm": true,
37+
"reduce_scatter": true,
38+
"reduce_bucket_size": 500000000,
39+
"contiguous_gradients": true
40+
},
41+
"train_micro_batch_size_per_gpu": 1,
42+
"autotuning": {
43+
"enabled": true,
44+
"arg_mappings": {
45+
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
46+
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
47+
}
48+
},
49+
"data-impl": "mmap",
50+
51+
"checkpoint-activations": false,
52+
"checkpoint-num-layers": 1,
53+
"partition-activations": true,
54+
"synchronize-each-layer": true,
55+
56+
"gradient_clipping": 1.0,
57+
"weight-decay": 0.1,
58+
"hidden-dropout": 0,
59+
"attention-dropout": 0,
60+
61+
"fp16": {
62+
"fp16": true,
63+
"enabled": true,
64+
"loss_scale": 0,
65+
"loss_scale_window": 1000,
66+
"hysteresis": 2,
67+
"min_loss_scale": 1
68+
},
69+
70+
"train-iters": 320000,
71+
"lr-decay-iters": 320000,
72+
"distributed-backend": "nccl",
73+
"lr-decay-style": "cosine",
74+
"warmup": 0.01,
75+
"checkpoint-factor": 10000,
76+
"eval-interval": 1000,
77+
"eval-iters": 10,
78+
"launcher": "slurm",
79+
"deepspeed_slurm": true,
80+
"no_ssh_check": true,
81+
82+
"log-interval": 10,
83+
"steps_per_print": 10,
84+
"keep-last-n-checkpoints": 1,
85+
"wall_clock_breakdown": true
86+
}
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
{
2+
"pipe-parallel-size": 1,
3+
"model-parallel-size": 8,
4+
5+
"num-layers": 32,
6+
"hidden-size": 4096,
7+
"num-attention-heads": 32,
8+
"seq-length": 2048,
9+
"max-position-embeddings": 2048,
10+
"norm": "layernorm",
11+
"pos-emb": "rotary",
12+
"no-weight-tying": true,
13+
14+
"scaled-upper-triang-masked-softmax-fusion": false,
15+
"bias-gelu-fusion": false,
16+
17+
18+
"optimizer": {
19+
"type": "Adam",
20+
"params": {
21+
"lr": 0.00012,
22+
"betas": [0.9, 0.999],
23+
"eps": 1.0e-8
24+
}
25+
},
26+
27+
"train_micro_batch_size_per_gpu": 1,
28+
"zero_optimization": {
29+
"stage": [0, 1, 2, 3]
30+
},
31+
"data-impl": "mmap",
32+
"split": "949,50,1",
33+
34+
"checkpoint-activations": true,
35+
"checkpoint-num-layers": 1,
36+
"partition-activations": true,
37+
"synchronize-each-layer": true,
38+
39+
"gradient_clipping": 1.0,
40+
"weight-decay": 0,
41+
"hidden-dropout": 0,
42+
"attention-dropout": 0,
43+
44+
"fp16": {
45+
"fp16": true,
46+
"enabled": true,
47+
"loss_scale": 0,
48+
"loss_scale_window": 1000,
49+
"hysteresis": 2,
50+
"min_loss_scale": 1
51+
},
52+
53+
"train-iters": 100,
54+
"lr-decay-iters": 320000,
55+
"distributed-backend": "nccl",
56+
"lr-decay-style": "cosine",
57+
"warmup": 0.01,
58+
"checkpoint-factor": 10000,
59+
"eval-interval": 1000,
60+
"eval-iters": 10,
61+
"log-interval": 100,
62+
"steps_per_print": 10,
63+
"keep-last-n-checkpoints": 4,
64+
"wall_clock_breakdown": true,
65+
"launcher": "slurm",
66+
"deepspeed_slurm": true,
67+
"no_ssh_check": true,
68+
"comment": "neox",
69+
"autotuning": {
70+
"enabled": true,
71+
"mp_size": 8,
72+
"arg_mappings": {
73+
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
74+
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
75+
}
76+
}
77+
}

configs/neox_arguments.md

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -592,7 +592,7 @@ Optimizer Arguments
592592

593593

594594

595-
- **zero_stage**: int
595+
- **zero_stage**: typing.Union[int, typing.List[int], typing.Literal['all']]
596596

597597
Default = None
598598

@@ -1732,6 +1732,14 @@ Args for deepspeed config
17321732

17331733

17341734

1735+
- **autotuning**: dict
1736+
1737+
Default = None
1738+
1739+
Dictionary as described in DeepSpeed autotuning documentation: https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning
1740+
1741+
1742+
17351743
## NeoXArgsDeepspeedRunner
17361744

17371745
Args for deepspeed runner (deepspeed.launcher.runner).
@@ -1801,7 +1809,7 @@ Args for deepspeed runner (deepspeed.launcher.runner).
18011809

18021810

18031811

1804-
- **launcher**: str
1812+
- **launcher**: typing.Literal['pdsh', 'openmpi', 'mvapich', 'slurm']
18051813

18061814
Default = pdsh
18071815

@@ -1817,6 +1825,12 @@ Args for deepspeed runner (deepspeed.launcher.runner).
18171825

18181826

18191827

1828+
- **autotuning_run**: str
1829+
1830+
Default = None
1831+
1832+
Either "tune", "run", or `None`.
1833+
18201834
- **no_ssh_check**: bool
18211835

18221836
Default = False
@@ -1831,3 +1845,11 @@ Args for deepspeed runner (deepspeed.launcher.runner).
18311845

18321846
Adds a `--comment` to the DeepSpeed launch command. In DeeperSpeed this is passed on to the SlurmLauncher as well. Sometime necessary for cluster rules, or so I've heard.
18331847

1848+
1849+
1850+
- **no_ssh_check**: bool
1851+
1852+
Default = False
1853+
1854+
If `True` and running with multiple nodes, then DeepSpeedd doesn't conduct a check to ensure the head node is reachable with ssh.
1855+

configs/slurm_local.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"vocab-file": "data/gpt2-vocab.json",
3+
"merge-file": "data/gpt2-merges.txt",
4+
"save": "checkpoints",
5+
"checkpoint_validation_with_forward_pass": false,
6+
"tensorboard-dir": "tensorboard",
7+
"log-dir": "logs",
8+
"use_wandb": true,
9+
"wandb_host": "https://api.wandb.ai",
10+
"wandb_project": "neox"
11+
}

0 commit comments

Comments
 (0)