Skip to content

Bug when I run on single GPU #1694

Open
@kailashg26

Description

Command: tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device
Output:

INFO:torchtune.utils._logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  checkpoint_files:
  - model-00001-of-00004.safetensors
  - model-00002-of-00004.safetensors
  - model-00003-of-00004.safetensors
  - model-00004-of-00004.safetensors
  model_type: LLAMA3
  output_dir: /tmp/Meta-Llama-3.1-8B-Instruct/
  recipe_checkpoint: null
compile: false
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 64
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /tmp/lora_finetune_output
model:
  _component_: torchtune.models.llama3_1.lora_llama3_1_8b
  apply_lora_to_mlp: false
  apply_lora_to_output: false
  lora_alpha: 16
  lora_attn_modules:
  - q_proj
  - v_proj
  lora_dropout: 0.0
  lora_rank: 8
optimizer:
  _component_: torch.optim.AdamW
  lr: 0.0003
  weight_decay: 0.01
output_dir: /tmp/lora_finetune_output
profiler:
  _component_: torchtune.training.setup_torch_profiler
  active_steps: 2
  cpu: true
  cuda: true
  enabled: false
  num_cycles: 1
  output_dir: /tmp/lora_finetune_output/profiling_outputs
  profile_memory: false
  record_shapes: true
  wait_steps: 5
  warmup_steps: 5
  with_flops: false
  with_stack: false
resume_from_checkpoint: false
save_adapter_weights_only: false
seed: null
shuffle: true
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  max_seq_len: null
  path: /tmp/Meta-Llama-3.1-8B-Instruct/original/tokenizer.model

DEBUG:torchtune.utils._logging:Setting manual seed to local seed 3188944798. Local seed is seed + rank = 3188944798 + 0
Writing logs to /tmp/lora_finetune_output/log_1727379753.txt
Traceback (most recent call last):
  _File "/home/kailash/miniconda3/envs/llm/bin/tune", line 8, in <module>
    sys.exit(main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 185, in _run_cmd
    self._run_single_device(args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/_cli/run.py", line 94, in _run_single_device
    runpy.run_path(str(args.recipe), run_name="__main__")
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 739, in <module>
    sys.exit(recipe_main())
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_parse.py", line 99, in wrapper
    sys.exit(recipe_main(conf))
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 733, in recipe_main
    recipe.setup(cfg=cfg)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 215, in setup
    checkpoint_dict = self.load_checkpoint(cfg_checkpointer=cfg.checkpointer)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 148, in load_checkpoint
    self._checkpointer = config.instantiate(
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 106, in instantiate
    return _instantiate_node(OmegaConf.to_object(config), *args)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 31, in _instantiate_node
    return _create_component(_component_, args, kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 348, in __init__
    self._checkpoint_paths = self._validate_hf_checkpoint_files(checkpoint_files)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_checkpointer.py", line 389, in _validate_hf_checkpoint_files
    checkpoint_path = get_path(self._checkpoint_dir, f)
  File "/home/kailash/miniconda3/envs/llm/lib/python3.9/site-packages/torchtune/training/checkpointing/_utils.py", line 95, in get_path
    raise ValueError(f"No file with name: {filename} found in {input_dir}.")
ValueError: No file with name: model-00001-of-00004.safetensors found in /tmp/Meta-Llama-3.1-8B-Instruct._

Can any help me with this?

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions