Error while full finetuning Llama 3.2 Vision

I am running this with nightlies with custom torchtune.datasets.custom_sft_dataset https://github.com/meta-llama/llama-cookbook/blob/ft-fw-2/src/finetune_pipeline/finetuning/custom_sft_dataset.py
And get padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree' Error

------
Logs
```
tune run --nproc_per_node 4 full_finetune_distributed --config finetune_config.yaml
Running with torchrun...
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] 
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_val: 1
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00005'
  model_type: LLAMA3_VISION
  output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
  recipe_checkpoint: null
clip_grad_norm: 1.0
**collate_fn: torchtune.data.padded_collate_tiled_images_and_mask**
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
  _component_: torch.optim.AdamW
  betas:
  - 0.9
  - 0.999
  fused: false
  lr: 2.0e-05
  weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
  _component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  image_size: 560
  max_seq_len: null
  path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model

Running FullFinetuneRecipeDistributed with resolved config:

batch_size: 1
batch_size_val: 1
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
  checkpoint_files:
    filename_format: model-{}-of-{}.safetensors
    max_filename: '00005'
  model_type: LLAMA3_VISION
  output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
  recipe_checkpoint: null
clip_grad_norm: 1.0
collate_fn: torchtune.data.padded_collate_tiled_images_and_mask
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[:95%]
  train_on_input: false
dataset_val:
  _component_: torchtune.datasets.custom_sft_dataset
  dataset_path: /mnt/disks/data/finetune_data.jsonl
  source: json
  split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
  _component_: torch.nn.CrossEntropyLoss
  ignore_index: -100
  reduction: mean
max_steps_per_epoch: null
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
  _component_: torch.optim.AdamW
  betas:
  - 0.9
  - 0.999
  fused: false
  lr: 2.0e-05
  weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
  _component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
  _component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
  image_size: 560
  max_seq_len: null
  path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model


Writing logs to /mnt/disks/data/finetune/llama3_vision_finetune/logs/log_1759253125.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Instantiating model and loading checkpoint took 5.75 secs
Memory stats after model init:
	GPU peak memory active: 5.89 GiB
	GPU peak memory alloc: 5.89 GiB
	GPU peak memory reserved: 5.95 GiB
Optimizer is initialized.
Loss is initialized.
No learning rate scheduler configured. Using constant learning rate.
 Profiling disabled.
 Profiler config after instantiation: {'enabled': False}
0|0:   0%|                                                                                                                                             | 0/226 [00:00<?, ?it/s][rank3]: Traceback (most recent call last):
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1169, in <module>
[rank3]:     sys.exit(recipe_main())
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]:     sys.exit(recipe_main(conf))
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1164, in recipe_main
[rank3]:     recipe.train()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 987, in train
[rank3]:     batch = next(dataloader_iter)
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 450, in __next__
[rank3]:     return super().__next__()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
[rank3]:     data = self._next_data()
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 491, in _next_data
[rank3]:     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
[rank3]:   File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank3]:     return self.collate_fn(data)
[rank3]: TypeError: padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree'
```

Any idea how to fix this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error while full finetuning Llama 3.2 Vision #2930

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error while full finetuning Llama 3.2 Vision #2930

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions