Skip to content

double free or corruption (!prev) #770

Open
@ltm920716

Description

@ltm920716

hello,
I test the llama2-70b-lora,but replace model with llama2-7b on 2 gpu 4090 node
running log:

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING]
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-10-15 09:46:30,947] torch.distributed.run: [WARNING] *****************************************
[2024-10-15 09:46:39,862] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:39,955] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-10-15 09:46:40,024] [INFO] [comm.py:637:init_distributed] cdb=None
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:40,173] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-10-15 09:46:40,173] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-10-15 09:46:41,603] [INFO] [partition_parameters.py:343:__exit__] finished initializing model - num_params = 291, num_elems = 6.74B
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.22s/it]
Loading checkpoint shards:  50%|█████████████████████████████████████████████████████████████████████████                                                                         | 1/2 [00:07<00:07,  7.24s/it]trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.88s/it]
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaFlashAttention2(
              (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
              (o_proj): lora.Linear(
                (base_layer): Linear(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
Parameter Offload: Total persistent parameters: 4460544 in 129 params
:::MLLOG {"namespace": "", "time_ms": 1728985613696, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": "True", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "llama2_70b_lora", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 97}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 101}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 105}}
:::MLLOG {"namespace": "", "time_ms": 1728985613697, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 108}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_name", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 112}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_poc_email", "value": "referece", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 116}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 120}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 2, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 124}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 3901, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 128}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 173, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 132}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "seed", "value": 2234, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 136}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 0.0, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_training_steps", "value": 1024, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 138}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_adamw_weight_decay", "value": 0.0001, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_gradient_clip_norm", "value": 0.3, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.0004, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_alpha", "value": 32, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "lora_rank", "value": 16, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 143}}
:::MLLOG {"namespace": "", "time_ms": 1728985613698, "event_type": "POINT_IN_TIME", "key": "gradient_accumulation_steps", "value": 1, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "init_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_END", "key": "init_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 147}}
:::MLLOG {"namespace": "", "time_ms": 1728985613699, "event_type": "INTERVAL_START", "key": "run_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 148}}
  0%|                                                                                                                                                                                  | 0/1024 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:428: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 3.818, 'grad_norm': 1.016438921422074, 'learning_rate': 0.00039945809133573807, 'epoch': 0.01}
  2%|███▉                                                                                                                                                                   | 24/1024 [01:36<1:06:20,  3.98s/it]:::MLLOG {"namespace": "", "time_ms": 1728985710476, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 3.818, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 48}}
{'loss': 3.3317, 'grad_norm': 1.0225008307763737, 'learning_rate': 0.0003994120140678966, 'epoch': 0.01}
{'loss': 2.7643, 'grad_norm': 0.9247777329488452, 'learning_rate': 0.0003978353019929562, 'epoch': 0.02}
{'loss': 2.2395, 'grad_norm': 0.7798605687888748, 'learning_rate': 0.0003951404260077057, 'epoch': 0.04}
  7%|███████████▋                                                                                                                                                           | 72/1024 [04:48<1:03:26,  4.00s/it]:::MLLOG {"namespace": "", "time_ms": 1728985902004, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 2.2395, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 144}}
{'loss': 2.0295, 'grad_norm': 1.0274388936605494, 'learning_rate': 0.00039500506901339887, 'epoch': 0.04}
{'loss': 2.067, 'grad_norm': 0.686157303906154, 'learning_rate': 0.0003913880671464418, 'epoch': 0.05}
{'eval_loss': 1.5954195261001587, 'eval_runtime': 136.6648, 'eval_samples_per_second': 1.266, 'eval_steps_per_second': 0.637, 'epoch': 0.05}
{'loss': 1.884, 'grad_norm': 0.6294034907656876, 'learning_rate': 0.0003865985597669478, 'epoch': 0.06}
 12%|███████████████████▍                                                                                                                                                  | 120/1024 [10:16<1:00:27,  4.01s/it]:::MLLOG {"namespace": "", "time_ms": 1728986230616, "event_type": "POINT_IN_TIME", "key": "train_loss", "value": 1.884, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 166, "samples_count": 240}}
{'loss': 1.8423, 'grad_norm': 0.5743935530993334, 'learning_rate': 0.00038637685311633367, 'epoch': 0.06}
{'loss': 1.8052, 'grad_norm': 0.6059068728905395, 'learning_rate': 0.0003807978586246887, 'epoch': 0.07}
{'eval_loss': 1.4383825063705444, 'eval_runtime': 136.9896, 'eval_samples_per_second': 1.263, 'eval_steps_per_second': 0.635, 'epoch': 0.07}
 14%|███████████████████████▋                                                                                                                                                | 144/1024 [14:10<58:46,  4.01s/it:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "block_stop", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 174, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "POINT_IN_TIME", "key": "eval_accuracy", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 179, "samples_count": 288}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_START", "key": "block_start", "value": "", "metadata": {"file": "mlperf_logging_utils.py", "lineno": 184, "samples_count": 144}}
:::MLLOG {"namespace": "", "time_ms": 1728986463780, "event_type": "INTERVAL_END", "key": "run_stop", "value": 1.4383825063705444, "metadata": {"file": "mlperf_logging_utils.py", "lineno": 195, "samples_count": 288, "status": "success"}}
{'loss': 1.8023, 'grad_norm': 0.8252179367975583, 'learning_rate': 0.0003805346636474518, 'epoch': 0.07}
{'train_runtime': 854.0888, 'train_samples_per_second': 2.398, 'train_steps_per_second': 1.199, 'train_loss': 2.429245588697236, 'epoch': 0.07}
 14%|███████████████████████▌                                                                                                                                              | 145/1024 [14:14<1:26:17,  5.89s/it]
double free or corruption (!prev)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions