Skip to content

Does This Fine-tuning code doesn't work in single A6000 GPU for LLaMA-2-7B with LoRA?  #359

@01choco

Description

@01choco

Hi, i am trying to work with your RLHF code to Fine-tune and Reinforcement learning for LLaMA. But I keep getting CUDA out of Memory error while fine-tuning LLaMA-2-7B model with single A6000 GPU even though i use PEFT LoRA method.

I applied these changes to get rid of CUDA om but error is still occuring.

  1. smaller batch size (1)
  2. smaller max length of token and sequence
  3. PEFT

Can i run LLaMA-2-7B fine-tuning with A6000 GPU? Does anyone have succeed LLaMA fine-tuning with single GPU? I just want to know if I'm doing something wrong or if it's just fundamentally impossible to fine tune this model into a single A6000 GPU.
And does anyone knows how to get rid of CUDA error in this situation?
Here is my config.yaml file!

  model: "llama-7B"
  model_folder: "./llama/llama-2-7b"
  tokenizer_path: "./llama/tokenizer.model"
  train_dataset_path: "./datasets/actor_training_data.json"
  validation_dataset_path: null
  # froze model embedding during training
  froze_embeddings: True
  # use fairscale layers to build the model instead of vanilla pytorch
  # only for llama
  use_fairscale: True
  # max sequence length for the actor (i.e. prompt + completion) it depends on
  # the model used.
  max_sequence_length: 1024
  # max tokens generated by the actor (completion only)
  max_tokens: 1024
  # minimum number of tokens generated by the actor
  min_tokens: 100
  # additional prompt tokens to be used for template or as safety
  additonal_prompt_tokens: 20
  # temperature for the actor
  temperature: 0.1
  batch_size: 2
  # number iteration after print
  iteration_per_print: 1
  lr: 0.000009
  epochs: 1
  # number of backpropagation after saving the checkpoints
  checkpoint_steps: 5000
  # number of checkpoints to keep while removing the older 
  # (keep memory consumption of checkpoints reasonable)
  n_checkpoints_to_keep: 5
  # here specify the name of the actor checkpoint from which resume 
  # during actor training. If null load the last one.
  checkpoint_name: null
  # deepspeed settings
  deepspeed_enable: True
  deepspeed_config_path: "./artifacts/config/ds_config.json"
  # accelerate settings
  accelerate_enable: False
  # use_peft - the parameters of PEFT can be modified in the peft_config.yaml
  peft_enable: True
  peft_config_path: "./artifacts/config/peft_config.yaml"

and here is my peft_config file:

inference_mode: False
r: 8
lora_alpha: 32
lora_dropout: 0.1

Thank you for reading!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions