I test the llama2-70b-lora,but replace model with llama2-7b on 2 gpu 4090 node
running log:
Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[2024-10-15 09:46:30,947] [WARNING]
[2024-10-15 09:46:30,947] [WARNING] *****************************************
[2024-10-15 09:46:30,947] [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-15 09:46:30,947] [WARNING] *****************************************
[2024-10-15 09:46:39,862] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:39,955] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:40,024] [INFO] [] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `'cuda')`.
[2024-10-15 09:46:40,173] [INFO] [] cdb=None
[2024-10-15 09:46:40,173] [INFO] [] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `'cuda')`.
[2024-10-15 09:46:41,603] [INFO] [] finished initializing model - num_params = 291, num_elems = 6.74B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.22s/it]
Loading checkpoint shards: 50%|█████████████████████████████████████████████████████████████████████████ | 1/2 [00:07<00:07, 7.24s/it]trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00, 4.88s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
(base_model): LoraModel(
(model): LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaFlashAttention2(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): lora.Linear(
(base_layer): Linear(in_features=4096, out_features=4096, bias=False)
(lora_dropout): ModuleDict(
(default): Dropout(p=0.1, inplace=False)
(lora_A): ModuleDict(
(default): Linear(in_features=4096, out_features=16, bias=False)
(lora_B): ModuleDict(
(default): Linear(in_features=16, out_features=4096, bias=False)
(lora_embedding_A): ParameterDict()
(lora_embedding_B): ParameterDict()
(rotary_emb): LlamaRotaryEmbedding()
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
(norm): LlamaRMSNorm()
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Parameter Offload: Total persistent parameters: 4460544 in 129 params
:::MLLOG {"namespace": "", "time_ms": 1728985613696, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": "True", "metadata": {"file": "", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "llama2_70b_lora", "metadata": {"file": "", "lineno": 97}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "", "lineno": 101}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "referece", "metadata": {"file": "", "lineno": 105}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "referece", "metadata": {"file": "", "lineno": 108}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_name", "value": "referece", "metadata": {"file": "", "lineno": 112}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_email", "value": "referece", "metadata": {"file": "", "lineno": 116}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "", "lineno": 120}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "", "lineno": 124}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 3901, "metadata": {"file": "", "lineno": 128}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 173, "metadata": {"file": "", "lineno": 132}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2234, "metadata": {"file": "", "lineno": 136}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.0, "metadata": {"file": "", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 1024, "metadata": {"file": "", "lineno": 138}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.0001, "metadata": {"file": "", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 0.3, "metadata": {"file": "", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_alpha", "value": 32, "metadata": {"file": "", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_rank", "value": 16, "metadata": {"file": "", "lineno": 143}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "init_start", "value": "", "metadata": {"file": "", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_END", "key": "init_stop", "value": "", "metadata": {"file": "", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "run_start", "value": "", "metadata": {"file": "", "lineno": 148}}
0%| | 0/1024 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/ UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/ UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
{'loss': 3.818, 'grad_norm': 1.016438921422074, 'learning_rate': 0.00039945809133573807, 'epoch': 0.01}
2%|███▉ | 24/1024 [01:36<1:06:20, 3.98s/it]:::MLLOG {"namespace": "", "time_ms": 1728985710476, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 3.818, "metadata": {"file": "", "lineno": 166, "samples_count": 48}}
{'loss': 3.3317, 'grad_norm': 1.0225008307763737, 'learning_rate': 0.0003994120140678966, 'epoch': 0.01}
{'loss': 2.7643, 'grad_norm': 0.9247777329488452, 'learning_rate': 0.0003978353019929562, 'epoch': 0.02}
{'loss': 2.2395, 'grad_norm': 0.7798605687888748, 'learning_rate': 0.0003951404260077057, 'epoch': 0.04}
7%|███████████▋ | 72/1024 [04:48<1:03:26, 4.00s/it]:::MLLOG {"namespace": "", "time_ms": 1728985902004, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 2.2395, "metadata": {"file": "", "lineno": 166, "samples_count": 144}}
{'loss': 2.0295, 'grad_norm': 1.0274388936605494, 'learning_rate': 0.00039500506901339887, 'epoch': 0.04}
{'loss': 2.067, 'grad_norm': 0.686157303906154, 'learning_rate': 0.0003913880671464418, 'epoch': 0.05}
{'eval_loss': 1.5954195261001587, 'eval_runtime': 136.6648, 'eval_samples_per_second': 1.266, 'eval_steps_per_second': 0.637, 'epoch': 0.05}
{'loss': 1.884, 'grad_norm': 0.6294034907656876, 'learning_rate': 0.0003865985597669478, 'epoch': 0.06}
12%|███████████████████▍ | 120/1024 [10:16<1:00:27, 4.01s/it]:::MLLOG {"namespace": "", "time_ms": 1728986230616, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.884, "metadata": {"file": "", "lineno": 166, "samples_count": 240}}
{'loss': 1.8423, 'grad_norm': 0.5743935530993334, 'learning_rate': 0.00038637685311633367, 'epoch': 0.06}
{'loss': 1.8052, 'grad_norm': 0.6059068728905395, 'learning_rate': 0.0003807978586246887, 'epoch': 0.07}
{'eval_loss': 1.4383825063705444, 'eval_runtime': 136.9896, 'eval_samples_per_second': 1.263, 'eval_steps_per_second': 0.635, 'epoch': 0.07}
14%|███████████████████████▋ | 144/1024 [14:10<58:46, 4.01s/it:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "block_stop", "value": "", "metadata": {"file": "", "lineno": 174, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 1.4383825063705444, "metadata": {"file": "", "lineno": 179, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_START", "key": "block_start", "value": "", "metadata": {"file": "", "lineno": 184, "samples_count": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "run_stop", "value": 1.4383825063705444, "metadata": {"file": "", "lineno": 195, "samples_count": 288, "status": "success"}}
{'loss': 1.8023, 'grad_norm': 0.8252179367975583, 'learning_rate': 0.0003805346636474518, 'epoch': 0.07}
{'train_runtime': 854.0888, 'train_samples_per_second': 2.398, 'train_steps_per_second': 1.199, 'train_loss': 2.429245588697236, 'epoch': 0.07}
14%|███████████████████████▌ | 145/1024 [14:14<1:26:17, 5.89s/it]
double free or corruption (!prev)
No labels