Skip to content

Error while full finetuning Llama 3.2 Vision #2930

@dongmuzeng

Description

@dongmuzeng

I am running this with nightlies with custom torchtune.datasets.custom_sft_dataset https://github.com/meta-llama/llama-cookbook/blob/ft-fw-2/src/finetune_pipeline/finetuning/custom_sft_dataset.py
And get padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree' Error


Logs

tune run --nproc_per_node 4 full_finetune_distributed --config finetune_config.yaml
Running with torchrun...
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] 
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_val: 1
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00005'
  model_type: LLAMA3_VISION
  output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
  recipe_checkpoint: null
clip_grad_norm: 1.0
**collate_fn: torchtune.data.padded_collate_tiled_images_and_mask**
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
  _component_: torch.optim.AdamW
  betas:
  - 0.9
  - 0.999
  fused: false
  lr: 2.0e-05
  weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
  _component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  image_size: 560
  max_seq_len: null
  path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model

Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_val: 1
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00005'
  model_type: LLAMA3_VISION
  output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
  recipe_checkpoint: null
clip_grad_norm: 1.0
collate_fn: torchtune.data.padded_collate_tiled_images_and_mask
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
  _component_: torch.optim.AdamW
  betas:
  - 0.9
  - 0.999
  fused: false
  lr: 2.0e-05
  weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
  _component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  image_size: 560
  max_seq_len: null
  path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model


Writing logs to /mnt/disks/data/finetune/llama3_vision_finetune/logs/log_1759253125.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Instantiating model and loading checkpoint took 5.75 secs
Memory stats after model init:
	GPU peak memory active: 5.89 GiB
	GPU peak memory alloc: 5.89 GiB
	GPU peak memory reserved: 5.95 GiB
Optimizer is initialized.
Loss is initialized.
No learning rate scheduler configured. Using constant learning rate.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
0|0:   0%|                                                                                                                                             | 0/226 [00:00<?, ?it/s][rank3]: Traceback (most recent call last):
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1169, in <module>
[rank3]:     sys.exit(recipe_main())
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]:     sys.exit(recipe_main(conf))
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1164, in recipe_main
[rank3]:     recipe.train()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 987, in train
[rank3]:     batch = next(dataloader_iter)
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 450, in __next__
[rank3]:     return super().__next__()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
[rank3]:     data = self._next_data()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 491, in _next_data
[rank3]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank3]:     return self.collate_fn(data)
[rank3]: TypeError: padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree'

Any idea how to fix this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions