Skip to content

lr_scheduler bug #2931

@kingbackyang

Description

@kingbackyang

hi, when I use config/4B_full. I encountered the followed bugs:

batch_size: 12
checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /zeno_data/models/Qwen3-4B
checkpoint_files:

  • model-00001-of-00003.safetensors
  • model-00002-of-00003.safetensors
  • model-00003-of-00003.safetensors
    model_type: QWEN2
    output_dir: /zeno_data/torchtune/intent_rewrite
    recipe_checkpoint: null
    clip_grad_norm: null
    compile: true
    dataset:
    component: torchtune.datasets.chat_dataset
    conversation_column: messages
    conversation_style: openai
    data_files: intent_rewrite_1011_test.json
    packed: true
    source: json
    split: train
    train_on_input: false
    device: cuda
    dtype: bf16
    enable_activation_checkpointing: true
    enable_activation_offloading: false
    epochs: 5
    gradient_accumulation_steps: 8
    log_every_n_steps: 1
    log_level: INFO
    log_peak_memory_stats: true
    loss:
    component: torchtune.modules.loss.LinearCrossEntropyLoss
    lr_scheduler:
    component: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
    num_warmup_steps: 200
    max_steps_per_epoch: null
    metric_logger:
    component: torchtune.training.metric_logging.DiskLogger
    log_dir: /zeno_data/torchtune/intent_rewrite/logs
    model:
    component: torchtune.models.qwen3.qwen3_4b_instruct
    optimizer:
    component: torch.optim.AdamW
    fused: true
    lr: 5.0e-06
    optimizer_in_bwd: false
    output_dir: /zeno_data/torchtune/intent_rewrite
    profiler:
    component: torchtune.training.setup_torch_profiler
    active_steps: 2
    cpu: true
    cuda: true
    enabled: false
    num_cycles: 1
    output_dir: /zeno_data/torchtune/intent_rewrite/profiling_outputs
    profile_memory: false
    record_shapes: true
    wait_steps: 5
    warmup_steps: 3
    with_flops: false
    with_stack: false
    resume_from_checkpoint: false
    seed: null
    shuffle: true
    tokenizer:
    component: torchtune.models.qwen3.qwen3_tokenizer
    max_seq_len: 3072
    merges_file: /zeno_data/models/Qwen3-4B/merges.txt
    path: /zeno_data/models/Qwen3-4B/vocab.json

Writing logs to /zeno_data/torchtune/intent_rewrite/logs/log_1760193224.txt
Distributed training is enabled. Instantiating model and loading checkpoint on Rank 0 ...
Compiling model layers with torch.compile. Expect a relatively slower first step.
Instantiating model and loading checkpoint took 2.95 secs
Memory stats after model init:
GPU peak memory active: 1.02 GiB
GPU peak memory alloc: 1.02 GiB
GPU peak memory reserved: 1.10 GiB
Optimizer is initialized.
Compiling loss with torch.compile...
Loss is initialized.
Packing dataset: 80%|██████████████████████████████████████████████████████████████▊ | 805/1000 [00:05<00:01, 156.02it/s][rank4]: Traceback (most recent call last):
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1169, in
[rank4]: sys.exit(recipe_main())
[rank4]: File "/zeno_data/torchtune/torchtune/config/_parse.py", line 99, in wrapper
[rank4]: sys.exit(recipe_main(conf))
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 1163, in recipe_main
[rank4]: recipe.setup(cfg=cfg)
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 492, in setup
[rank4]: self._lr_scheduler = self._setup_lr_scheduler(
[rank4]: File "/zeno_data/torchtune/recipes/full_finetune_distributed.py", line 535, in _setup_lr_scheduler
[rank4]: lr_scheduler = config.instantiate(
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 163, in instantiate
[rank4]: return _instantiate_node(
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 62, in _instantiate_node
[rank4]: return _create_component(component, args, kwargs)
[rank4]: File "/zeno_data/torchtune/torchtune/config/_instantiate.py", line 24, in _create_component
[rank4]: return component(*args, **kwargs)
[rank4]: File "/zeno_data/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
[rank4]: return LambdaLR(optimizer, lr_lambda, last_epoch)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 318, in init
[rank4]: super().init(optimizer, last_epoch)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 140, in init
[rank4]: patch_track_step_called(self.optimizer)
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 138, in patch_track_step_called
[rank4]: opt.step = wrap_step(opt.step) # type: ignore[method-assign]
[rank4]: File "/root/miniforge3/envs/tune/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 127, in wrap_step
[rank4]: func = step_fn.func
[rank4]: AttributeError: 'function' object has no attribute 'func'. Did you mean: 'doc'?

when I set the compile false, it runs successfully. I want compile the model to make the training faster. So, could you please resolve this bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions