Skip to content

double free or corruption (!prev) #770




I test the llama2-70b-lora,but replace model with llama2-7b on 2 gpu 4090 node
running log:

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[2024-10-15 09:46:30,947] [WARNING]
[2024-10-15 09:46:30,947] [WARNING] *****************************************
[2024-10-15 09:46:30,947] [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-15 09:46:30,947] [WARNING] *****************************************
[2024-10-15 09:46:39,862] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:39,955] [INFO] [] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:40,024] [INFO] [] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `'cuda')`.
[2024-10-15 09:46:40,173] [INFO] [] cdb=None
[2024-10-15 09:46:40,173] [INFO] [] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `'cuda')`.
[2024-10-15 09:46:41,603] [INFO] [] finished initializing model - num_params = 291, num_elems = 6.74B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.22s/it]
Loading checkpoint shards:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 1/2 [00:07<00:07,  7.24s/it]trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.88s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaFlashAttention2(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (o_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              (rotary_emb): LlamaRotaryEmbedding()
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
              (act_fn): SiLU()
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
        (norm): LlamaRMSNorm()
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Parameter Offload: Total persistent parameters: 4460544 in 129 params
:::MLLOG {"namespace": "", "time_ms": 1728985613696, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": "True", "metadata": {"file": "", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "llama2_70b_lora", "metadata": {"file": "", "lineno": 97}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "", "lineno": 101}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "referece", "metadata": {"file": "", "lineno": 105}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "referece", "metadata": {"file": "", "lineno": 108}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_name", "value": "referece", "metadata": {"file": "", "lineno": 112}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_email", "value": "referece", "metadata": {"file": "", "lineno": 116}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "", "lineno": 120}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "", "lineno": 124}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 3901, "metadata": {"file": "", "lineno": 128}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 173, "metadata": {"file": "", "lineno": 132}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2234, "metadata": {"file": "", "lineno": 136}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.0, "metadata": {"file": "", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 1024, "metadata": {"file": "", "lineno": 138}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.0001, "metadata": {"file": "", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 0.3, "metadata": {"file": "", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_alpha", "value": 32, "metadata": {"file": "", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_rank", "value": 16, "metadata": {"file": "", "lineno": 143}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "init_start", "value": "", "metadata": {"file": "", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_END", "key": "init_stop", "value": "", "metadata": {"file": "", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "run_start", "value": "", "metadata": {"file": "", "lineno": 148}}
  0%|                                                                                                                                                                                  | 0/1024 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/ UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/ UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
{'loss': 3.818, 'grad_norm': 1.016438921422074, 'learning_rate': 0.00039945809133573807, 'epoch': 0.01}
  2%|███▉                                                                                                                                                                   | 24/1024 [01:36<1:06:20,  3.98s/it]:::MLLOG {"namespace": "", "time_ms": 1728985710476, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 3.818, "metadata": {"file": "", "lineno": 166, "samples_count": 48}}
{'loss': 3.3317, 'grad_norm': 1.0225008307763737, 'learning_rate': 0.0003994120140678966, 'epoch': 0.01}
{'loss': 2.7643, 'grad_norm': 0.9247777329488452, 'learning_rate': 0.0003978353019929562, 'epoch': 0.02}
{'loss': 2.2395, 'grad_norm': 0.7798605687888748, 'learning_rate': 0.0003951404260077057, 'epoch': 0.04}
  7%|███████████▋                                                                                                                                                           | 72/1024 [04:48<1:03:26,  4.00s/it]:::MLLOG {"namespace": "", "time_ms": 1728985902004, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 2.2395, "metadata": {"file": "", "lineno": 166, "samples_count": 144}}
{'loss': 2.0295, 'grad_norm': 1.0274388936605494, 'learning_rate': 0.00039500506901339887, 'epoch': 0.04}
{'loss': 2.067, 'grad_norm': 0.686157303906154, 'learning_rate': 0.0003913880671464418, 'epoch': 0.05}
{'eval_loss': 1.5954195261001587, 'eval_runtime': 136.6648, 'eval_samples_per_second': 1.266, 'eval_steps_per_second': 0.637, 'epoch': 0.05}
{'loss': 1.884, 'grad_norm': 0.6294034907656876, 'learning_rate': 0.0003865985597669478, 'epoch': 0.06}
 12%|███████████████████▍                                                                                                                                                  | 120/1024 [10:16<1:00:27,  4.01s/it]:::MLLOG {"namespace": "", "time_ms": 1728986230616, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.884, "metadata": {"file": "", "lineno": 166, "samples_count": 240}}
{'loss': 1.8423, 'grad_norm': 0.5743935530993334, 'learning_rate': 0.00038637685311633367, 'epoch': 0.06}
{'loss': 1.8052, 'grad_norm': 0.6059068728905395, 'learning_rate': 0.0003807978586246887, 'epoch': 0.07}
{'eval_loss': 1.4383825063705444, 'eval_runtime': 136.9896, 'eval_samples_per_second': 1.263, 'eval_steps_per_second': 0.635, 'epoch': 0.07}
 14%|███████████████████████▋                                                                                                                                                | 144/1024 [14:10<58:46,  4.01s/it:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "block_stop", "value": "", "metadata": {"file": "", "lineno": 174, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 1.4383825063705444, "metadata": {"file": "", "lineno": 179, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_START", "key": "block_start", "value": "", "metadata": {"file": "", "lineno": 184, "samples_count": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "run_stop", "value": 1.4383825063705444, "metadata": {"file": "", "lineno": 195, "samples_count": 288, "status": "success"}}
{'loss': 1.8023, 'grad_norm': 0.8252179367975583, 'learning_rate': 0.0003805346636474518, 'epoch': 0.07}
{'train_runtime': 854.0888, 'train_samples_per_second': 2.398, 'train_steps_per_second': 1.199, 'train_loss': 2.429245588697236, 'epoch': 0.07}
 14%|███████████████████████▌                                                                                                                                              | 145/1024 [14:14<1:26:17,  5.89s/it]
double free or corruption (!prev)


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment




No one assigned


    No labels
    No labels


    No type


    No projects


    No milestone


    None yet


    No branches or pull requests

    Issue actions