-
Notifications
You must be signed in to change notification settings - Fork 689
Description
hi, when I use config/4B_full. I encountered the followed bugs:
batch_size: 12
checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /zeno_data/models/Qwen3-4B
checkpoint_files:
- model-00001-of-00003.safetensors
- model-00002-of-00003.safetensors
- model-00003-of-00003.safetensors
model_type: QWEN2
output_dir: /zeno_data/torchtune/intent_rewrite
recipe_checkpoint: null
clip_grad_norm: null
compile: true
dataset:
component: torchtune.datasets.chat_dataset
conversation_column: messages
conversation_style: openai
data_files: intent_rewrite_1011_test.json
packed: true
source: json
split: train
train_on_input: false
device: cuda
dtype: bf16
enable_activation_checkpointing: true
enable_activation_offloading: false
epochs: 5
gradient_accumulation_steps: 8
log_every_n_steps: 1
log_level: INFO
log_peak_memory_stats: true
loss:
component: torchtune.modules.loss.LinearCrossEntropyLoss
lr_scheduler:
component: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 200
max_steps_per_epoch: null
metric_logger:
component: torchtune.training.metric_logging.DiskLogger
log_dir: /zeno_data/torchtune/intent_rewrite/logs
model:
component: torchtune.models.qwen3.qwen3_4b_instruct
optimizer:
component: torch.optim.AdamW
fused: true
lr: 5.0e-06
optimizer_in_bwd: false
output_dir: /zeno_data/torchtune/intent_rewrite
profiler:
component: torchtune.training.setup_torch_profiler
active_steps: 2
cpu: true
cuda: true
enabled: false
num_cycles: 1
output_dir: /zeno_data/torchtune/intent_rewrite/profiling_outputs
profile_memory: false
record_shapes: true
wait_steps: 5
warmup_steps: 3
with_flops: false
with_stack: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
component: torchtune.models.qwen3.qwen3_tokenizer
max_seq_len: 3072
merges_file: /zeno_data/models/Qwen3-4B/merges.txt
path: /zeno_data/models/Qwen3-4B/vocab.jsonWriting logs to /zeno_data/torchtune/intent_rewrite/logs/log_1760193224.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Compiling model layers with torch.compile. Expect a relatively slower first step.
Instantiating model and loading checkpoint took 2.95 secs
Memory stats after model init:
GPU peak memory active: 1.02 GiB
GPU peak memory alloc: 1.02 GiB
GPU peak memory reserved: 1.10 GiB
Optimizer is initialized.
Compiling loss with torch.compile...
Loss is initialized.
Packing dataset: 80%|██████████████████████████████████████████████████████████████▊ | 805/1000 [00:05<00:01, 156.02it/s][rank4]: Traceback (most recent call last):
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1169, in
[rank4]: sys.exit(recipe_main())
[rank4]: File "/zeno_data/torchtune/torchtune/config/_parse.py", line 99, in wrapper
[rank4]: sys.exit(recipe_main(conf))
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1163, in recipe_main
[rank4]: recipe.setup(cfg=cfg)
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 492, in setup
[rank4]: self._lr_scheduler = self._setup_lr_scheduler(
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 535, in _setup_lr_scheduler
[rank4]: lr_scheduler = config.instantiate(
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 163, in instantiate
[rank4]: return _instantiate_node(
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 62, in _instantiate_node
[rank4]: return _create_component(component, args, kwargs)
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 24, in _create_component
[rank4]: return component(*args, **kwargs)
[rank4]: File "/zeno_data/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
[rank4]: return LambdaLR(optimizer, lr_lambda, last_epoch)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 318, in init
[rank4]: super().init(optimizer, last_epoch)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 140, in init
[rank4]: patch_track_step_called(self.optimizer)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 138, in patch_track_step_called
[rank4]: opt.step = wrap_step(opt.step) # type: ignore[method-assign]
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 127, in wrap_step
[rank4]: func = step_fn.func
[rank4]: AttributeError: 'function' object has no attribute 'func'. Did you mean: 'doc'?
when I set the compile false, it runs successfully. I want compile the model to make the training faster. So, could you please resolve this bug?