double free or corruption (!prev)

hello，
  I test the llama2-70b-lora，but replace model with llama2-7b on 2 gpu 4090 node
running log：
```
Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING]
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:39,862] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:39,955] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:40,024] [INFO] [comm.py:637:init_distributed] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:40,173] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-15 09:46:40,173] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:41,603] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.22s/it]
Loading checkpoint shards:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 1/2 [00:07<00:07,  7.24s/it]trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.88s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaFlashAttention2(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (o_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Parameter Offload: Total persistent parameters: 4460544 in 129 params
:::MLLOG {"namespace": "", "time_ms": 1728985613696, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": "True", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "llama2_70b_lora", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 97}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 101}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 105}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 108}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_name", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 112}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_email", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 116}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 120}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 124}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 3901, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 128}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 173, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 132}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2234, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 136}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.0, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 1024, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 138}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.0001, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 0.3, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_alpha", "value": 32, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_rank", "value": 16, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 143}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "init_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_END", "key": "init_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "run_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 148}}
  0%|                                                                                                                                                                                  | 0/1024 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 3.818, 'grad_norm': 1.016438921422074, 'learning_rate': 0.00039945809133573807, 'epoch': 0.01}
  2%|███▉                                                                                                                                                                   | 24/1024 [01:36<1:06:20,  3.98s/it]:::MLLOG {"namespace": "", "time_ms": 1728985710476, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 3.818, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 48}}
{'loss': 3.3317, 'grad_norm': 1.0225008307763737, 'learning_rate': 0.0003994120140678966, 'epoch': 0.01}
{'loss': 2.7643, 'grad_norm': 0.9247777329488452, 'learning_rate': 0.0003978353019929562, 'epoch': 0.02}
{'loss': 2.2395, 'grad_norm': 0.7798605687888748, 'learning_rate': 0.0003951404260077057, 'epoch': 0.04}
  7%|███████████▋                                                                                                                                                           | 72/1024 [04:48<1:03:26,  4.00s/it]:::MLLOG {"namespace": "", "time_ms": 1728985902004, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 2.2395, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 144}}
{'loss': 2.0295, 'grad_norm': 1.0274388936605494, 'learning_rate': 0.00039500506901339887, 'epoch': 0.04}
{'loss': 2.067, 'grad_norm': 0.686157303906154, 'learning_rate': 0.0003913880671464418, 'epoch': 0.05}
{'eval_loss': 1.5954195261001587, 'eval_runtime': 136.6648, 'eval_samples_per_second': 1.266, 'eval_steps_per_second': 0.637, 'epoch': 0.05}
{'loss': 1.884, 'grad_norm': 0.6294034907656876, 'learning_rate': 0.0003865985597669478, 'epoch': 0.06}
 12%|███████████████████▍                                                                                                                                                  | 120/1024 [10:16<1:00:27,  4.01s/it]:::MLLOG {"namespace": "", "time_ms": 1728986230616, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.884, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 240}}
{'loss': 1.8423, 'grad_norm': 0.5743935530993334, 'learning_rate': 0.00038637685311633367, 'epoch': 0.06}
{'loss': 1.8052, 'grad_norm': 0.6059068728905395, 'learning_rate': 0.0003807978586246887, 'epoch': 0.07}
{'eval_loss': 1.4383825063705444, 'eval_runtime': 136.9896, 'eval_samples_per_second': 1.263, 'eval_steps_per_second': 0.635, 'epoch': 0.07}
 14%|███████████████████████▋                                                                                                                                                | 144/1024 [14:10<58:46,  4.01s/it:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "block_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 174, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 179, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_START", "key": "block_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 184, "samples_count": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "run_stop", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 195, "samples_count": 288, "status": "success"}}
{'loss': 1.8023, 'grad_norm': 0.8252179367975583, 'learning_rate': 0.0003805346636474518, 'epoch': 0.07}
{'train_runtime': 854.0888, 'train_samples_per_second': 2.398, 'train_steps_per_second': 1.199, 'train_loss': 2.429245588697236, 'epoch': 0.07}
 14%|███████████████████████▌                                                                                                                                              | 145/1024 [14:14<1:26:17,  5.89s/it]
double free or corruption (!prev)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

double free or corruption (!prev) #770

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

double free or corruption (!prev) #770

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions