-
Notifications
You must be signed in to change notification settings - Fork 689
Open
Description
I am running this with nightlies with custom torchtune.datasets.custom_sft_dataset https://github.com/meta-llama/llama-cookbook/blob/ft-fw-2/src/finetune_pipeline/finetuning/custom_sft_dataset.py
And get padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree' Error
Logs
tune run --nproc_per_node 4 full_finetune_distributed --config finetune_config.yaml
Running with torchrun...
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810]
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0930 17:25:21.127000 15690 site-packages/torch/distributed/run.py:810] *****************************************
Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 1
batch_size_val: 1
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00005'
model_type: LLAMA3_VISION
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
recipe_checkpoint: null
clip_grad_norm: 1.0
**collate_fn: torchtune.data.padded_collate_tiled_images_and_mask**
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.custom_sft_dataset
dataset_path: /mnt/disks/data/finetune_data.jsonl
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.custom_sft_dataset
dataset_path: /mnt/disks/data/finetune_data.jsonl
source: json
split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
_component_: torch.optim.AdamW
betas:
- 0.9
- 0.999
fused: false
lr: 2.0e-05
weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
_component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
image_size: 560
max_seq_len: null
path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
Running FullFinetuneRecipeDistributed with resolved config:
batch_size: 1
batch_size_val: 1
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct
checkpoint_files:
filename_format: model-{}-of-{}.safetensors
max_filename: '00005'
model_type: LLAMA3_VISION
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
recipe_checkpoint: null
clip_grad_norm: 1.0
collate_fn: torchtune.data.padded_collate_tiled_images_and_mask
compile: false
data_parallel_replicate_dim: 1
data_parallel_shard_dim: -1
dataset:
_component_: torchtune.datasets.custom_sft_dataset
dataset_path: /mnt/disks/data/finetune_data.jsonl
source: json
split: train[:95%]
train_on_input: false
dataset_val:
_component_: torchtune.datasets.custom_sft_dataset
dataset_path: /mnt/disks/data/finetune_data.jsonl
source: json
split: train[95%:]
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: true
epochs: 5
gradient_accumulation_steps: 1
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
_component_: torch.nn.CrossEntropyLoss
ignore_index: -100
reduction: mean
max_steps_per_epoch: null
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: /mnt/disks/data/finetune/llama3_vision_finetune/logs
model:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_11b
optimizer:
_component_: torch.optim.AdamW
betas:
- 0.9
- 0.999
fused: false
lr: 2.0e-05
weight_decay: 0.01
optimizer_in_bwd: false
output_dir: /mnt/disks/data/finetune/llama3_vision_finetune
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: false
resume_from_checkpoint: false
run_val_every_n_steps: 20
save_every_n_steps: 20
seed: 42
shuffle: true
tensor_parallel_dim: 1
tensor_parallel_plan:
_component_: torchtune.models.llama3_2.decoder_only_tp_plan
tokenizer:
_component_: torchtune.models.llama3_2_vision.llama3_2_vision_transform
image_size: 560
max_seq_len: null
path: /mnt/disks/data/Llama-3.2-11B-Vision-Instruct/original/tokenizer.model
Writing logs to /mnt/disks/data/finetune/llama3_vision_finetune/logs/log_1759253125.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Instantiating model and loading checkpoint took 5.75 secs
Memory stats after model init:
GPU peak memory active: 5.89 GiB
GPU peak memory alloc: 5.89 GiB
GPU peak memory reserved: 5.95 GiB
Optimizer is initialized.
Loss is initialized.
No learning rate scheduler configured. Using constant learning rate.
Profiling disabled.
Profiler config after instantiation: {'enabled': False}
0|0: 0%| | 0/226 [00:00<?, ?it/s][rank3]: Traceback (most recent call last):
[rank3]: File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1169, in <module>
[rank3]: sys.exit(recipe_main())
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]: sys.exit(recipe_main(conf))
[rank3]: File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 1164, in recipe_main
[rank3]: recipe.train()
[rank3]: File "/opt/conda/lib/python3.10/site-packages/recipes/full_finetune_distributed.py", line 987, in train
[rank3]: batch = next(dataloader_iter)
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 450, in __next__
[rank3]: return super().__next__()
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
[rank3]: data = self._next_data()
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torchdata/stateful_dataloader/stateful_dataloader.py", line 491, in _next_data
[rank3]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank3]: File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank3]: return self.collate_fn(data)
[rank3]: TypeError: padded_collate_tiled_images_and_mask() got an unexpected keyword argument 'cp_degree'
Any idea how to fix this?
Metadata
Metadata
Assignees
Labels
No labels