lr_scheduler bug

hi, when I use config/4B_full. I encountered the followed bugs:

> batch_size: 12
> checkpointer:
>   _component_: torchtune.training.FullModelHFCheckpointer
>   checkpoint_dir: /zeno_data/models/Qwen3-4B
>   checkpoint_files:
>   - model-00001-of-00003.safetensors
>   - model-00002-of-00003.safetensors
>   - model-00003-of-00003.safetensors
>   model_type: QWEN2
>   output_dir: /zeno_data/torchtune/intent_rewrite
>   recipe_checkpoint: null
> clip_grad_norm: null
> compile: true
> dataset:
>   _component_: torchtune.datasets.chat_dataset
>   conversation_column: messages
>   conversation_style: openai
>   data_files: intent_rewrite_1011_test.json
>   packed: true
>   source: json
>   split: train
>   train_on_input: false
> device: cuda
> dtype: bf16
> enable_activation_checkpointing: true
> enable_activation_offloading: false
> epochs: 5
> gradient_accumulation_steps: 8
> log_every_n_steps: 1
> log_level: INFO
> log_peak_memory_stats: true
> loss:
>   _component_: torchtune.modules.loss.LinearCrossEntropyLoss
> lr_scheduler:
>   _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
>   num_warmup_steps: 200
> max_steps_per_epoch: null
> metric_logger:
>   _component_: torchtune.training.metric_logging.DiskLogger
>   log_dir: /zeno_data/torchtune/intent_rewrite/logs
> model:
>   _component_: torchtune.models.qwen3.qwen3_4b_instruct
> optimizer:
>   _component_: torch.optim.AdamW
>   fused: true
>   lr: 5.0e-06
> optimizer_in_bwd: false
> output_dir: /zeno_data/torchtune/intent_rewrite
> profiler:
>   _component_: torchtune.training.setup_torch_profiler
>   active_steps: 2
>   cpu: true
>   cuda: true
>   enabled: false
>   num_cycles: 1
>   output_dir: /zeno_data/torchtune/intent_rewrite/profiling_outputs
>   profile_memory: false
>   record_shapes: true
>   wait_steps: 5
>   warmup_steps: 3
>   with_flops: false
>   with_stack: false
> resume_from_checkpoint: false
> seed: null
> shuffle: true
> tokenizer:
>   _component_: torchtune.models.qwen3.qwen3_tokenizer
>   max_seq_len: 3072
>   merges_file: /zeno_data/models/Qwen3-4B/merges.txt
>   path: /zeno_data/models/Qwen3-4B/vocab.json
> 
> Writing logs to /zeno_data/torchtune/intent_rewrite/logs/log_1760193224.txt
> Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
> Compiling model layers with torch.compile. Expect a relatively slower first step.
> Instantiating model and loading checkpoint took 2.95 secs
> Memory stats after model init:
>         GPU peak memory active: 1.02 GiB
>         GPU peak memory alloc: 1.02 GiB
>         GPU peak memory reserved: 1.10 GiB
> Optimizer is initialized.
> Compiling loss with torch.compile...
> Loss is initialized.
> Packing dataset:  80%|██████████████████████████████████████████████████████████████▊               | 805/1000 [00:05<00:01, 156.02it/s][rank4]: Traceback (most recent call last):
> [rank4]:   File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1169, in <module>
> [rank4]:     sys.exit(recipe_main())
> [rank4]:   File "/zeno_data/torchtune/torchtune/config/_parse.py", line 99, in wrapper
> [rank4]:     sys.exit(recipe_main(conf))
> [rank4]:   File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1163, in recipe_main
> [rank4]:     recipe.setup(cfg=cfg)
> [rank4]:   File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 492, in setup
> [rank4]:     self._lr_scheduler = self._setup_lr_scheduler(
> [rank4]:   File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 535, in _setup_lr_scheduler
> [rank4]:     lr_scheduler = config.instantiate(
> [rank4]:   File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 163, in instantiate
> [rank4]:     return _instantiate_node(
> [rank4]:   File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 62, in _instantiate_node
> [rank4]:     return _create_component(_component_, args, kwargs)
> [rank4]:   File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 24, in _create_component
> [rank4]:     return _component_(*args, **kwargs)
> [rank4]:   File "/zeno_data/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
> [rank4]:     return LambdaLR(optimizer, lr_lambda, last_epoch)
> [rank4]:   File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 318, in __init__
> [rank4]:     super().__init__(optimizer, last_epoch)
> [rank4]:   File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 140, in __init__
> [rank4]:     patch_track_step_called(self.optimizer)
> [rank4]:   File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 138, in patch_track_step_called
> [rank4]:     opt.step = wrap_step(opt.step)  # type: ignore[method-assign]
> [rank4]:   File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 127, in wrap_step
> [rank4]:     func = step_fn.__func__
> [rank4]: AttributeError: 'function' object has no attribute '__func__'. Did you mean: '__doc__'?


when I set the compile false, it runs successfully. I want compile the model to make the training faster. So, could you please resolve this bug?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

lr_scheduler bug #2931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lr_scheduler bug #2931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions