- 
                Notifications
    
You must be signed in to change notification settings  - Fork 31k
 
Open
Labels
Description
System Info
transformersversion: 4.57.1- Platform: macOS-26.0.1-arm64-arm-64bit
 - Python version: 3.12.0
 - Huggingface_hub version: 0.35.3
 - Safetensors version: 0.5.3
 - Accelerate version: 1.7.0
 - Accelerate config: not found
 - DeepSpeed version: not installed
 - PyTorch version (accelerator?): 2.6.0 (NA)
 - Tensorflow version (GPU?): not installed (NA)
 - Flax version (CPU?/GPU?/TPU?): not installed (NA)
 - Jax version: not installed
 - JaxLib version: not installed
 - Using distributed or parallel set-up in script?: No
 
Who can help?
No response
Information
- The official example scripts
 - My own modified scripts
 
Tasks
-  An officially supported task in the 
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
 
Reproduction
Run the following code:
import torch
from datasets import Dataset
from torch import nn
from transformers import Trainer, TrainingArguments
class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 2)
    def forward(self, a, return_loss=True):
        output = self.linear(a)
        return {"loss": output.sum()}
data = torch.tensor([[i, i] for i in range(10)], dtype=torch.float32)  # [[0., 0.], [1., 1.], [2., 2.], ...]
dataset = Dataset.from_dict({"a": data}).to_iterable_dataset()  # finite iterable dataset
args = TrainingArguments(output_dir=".", per_device_train_batch_size=1, max_steps=20, logging_steps=1)
trainer = Trainer(model=MyModule(), args=args, train_dataset=dataset)
trainer.train(){'loss': 0.9867, 'grad_norm': 1.4142135381698608, 'learning_rate': 5e-05, 'epoch': 0.05}
{'loss': 1.3851, 'grad_norm': 2.4494898319244385, 'learning_rate': 4.75e-05, 'epoch': 0.1}
{'loss': 1.7833, 'grad_norm': 4.242640495300293, 'learning_rate': 4.5e-05, 'epoch': 0.15}
{'loss': 2.1812, 'grad_norm': 6.164413928985596, 'learning_rate': 4.25e-05, 'epoch': 0.2}
{'loss': 2.5788, 'grad_norm': 8.124038696289062, 'learning_rate': 4e-05, 'epoch': 0.25}
{'loss': 2.9761, 'grad_norm': 10.099504470825195, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.3}
{'loss': 3.3731, 'grad_norm': 12.083045959472656, 'learning_rate': 3.5e-05, 'epoch': 0.35}
{'loss': 3.7699, 'grad_norm': 14.071247100830078, 'learning_rate': 3.2500000000000004e-05, 'epoch': 0.4}
{'loss': 4.1665, 'grad_norm': 16.0623779296875, 'learning_rate': 3e-05, 'epoch': 0.45}
{'loss': 4.563, 'grad_norm': 18.055469512939453, 'learning_rate': 2.7500000000000004e-05, 'epoch': 0.5}
{'loss': 0.9861, 'grad_norm': 1.4142135381698608, 'learning_rate': 2.5e-05, 'epoch': 1.05}
{'loss': 1.3833, 'grad_norm': 2.4494898319244385, 'learning_rate': 2.25e-05, 'epoch': 1.1}
{'loss': 1.7803, 'grad_norm': 4.242640495300293, 'learning_rate': 2e-05, 'epoch': 1.15}
{'loss': 2.1772, 'grad_norm': 6.164413928985596, 'learning_rate': 1.75e-05, 'epoch': 1.2}
{'loss': 2.574, 'grad_norm': 8.124038696289062, 'learning_rate': 1.5e-05, 'epoch': 1.25}
{'loss': 2.9707, 'grad_norm': 10.099504470825195, 'learning_rate': 1.25e-05, 'epoch': 1.3}
{'loss': 3.3673, 'grad_norm': 12.083045959472656, 'learning_rate': 1e-05, 'epoch': 1.35}
{'loss': 3.764, 'grad_norm': 14.071247100830078, 'learning_rate': 7.5e-06, 'epoch': 1.4}
{'loss': 4.1606, 'grad_norm': 16.0623779296875, 'learning_rate': 5e-06, 'epoch': 1.45}
{'loss': 4.5572, 'grad_norm': 18.055469512939453, 'learning_rate': 2.5e-06, 'epoch': 1.5}
{'train_runtime': 0.2074, 'train_samples_per_second': 96.438, 'train_steps_per_second': 96.438, 'train_loss': 2.774213859438896, 'epoch': 1.5}
In my understanding, epoch is computed at 
transformers/src/transformers/trainer.py
Line 2555 in 1f0b490
| self.state.epoch = epoch + (step + 1) / steps_in_epoch | 
steps_in_epoch is initialised with args.max_steps attransformers/src/transformers/trainer.py
Line 2402 in 1f0b490
| else args.max_steps * args.gradient_accumulation_steps | 
__len__, like the example above
Expected behavior
{'loss': 0.9867, 'grad_norm': 1.4142135381698608, 'learning_rate': 5e-05, 'epoch': 0.1}
{'loss': 1.3851, 'grad_norm': 2.4494898319244385, 'learning_rate': 4.75e-05, 'epoch': 0.2}
{'loss': 1.7833, 'grad_norm': 4.242640495300293, 'learning_rate': 4.5e-05, 'epoch': 0.3}
{'loss': 2.1812, 'grad_norm': 6.164413928985596, 'learning_rate': 4.25e-05, 'epoch': 0.4}
{'loss': 2.5788, 'grad_norm': 8.124038696289062, 'learning_rate': 4e-05, 'epoch': 0.5}
{'loss': 2.9761, 'grad_norm': 10.099504470825195, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.6}
{'loss': 3.3731, 'grad_norm': 12.083045959472656, 'learning_rate': 3.5e-05, 'epoch': 0.7}
{'loss': 3.7699, 'grad_norm': 14.071247100830078, 'learning_rate': 3.2500000000000004e-05, 'epoch': 0.8}
{'loss': 4.1665, 'grad_norm': 16.0623779296875, 'learning_rate': 3e-05, 'epoch': 0.9}
{'loss': 4.563, 'grad_norm': 18.055469512939453, 'learning_rate': 2.7500000000000004e-05, 'epoch': 1.0}
{'loss': 0.9861, 'grad_norm': 1.4142135381698608, 'learning_rate': 2.5e-05, 'epoch': 1.1}
{'loss': 1.3833, 'grad_norm': 2.4494898319244385, 'learning_rate': 2.25e-05, 'epoch': 1.2}
{'loss': 1.7803, 'grad_norm': 4.242640495300293, 'learning_rate': 2e-05, 'epoch': 1.3}
{'loss': 2.1772, 'grad_norm': 6.164413928985596, 'learning_rate': 1.75e-05, 'epoch': 1.4}
{'loss': 2.574, 'grad_norm': 8.124038696289062, 'learning_rate': 1.5e-05, 'epoch': 1.5}
{'loss': 2.9707, 'grad_norm': 10.099504470825195, 'learning_rate': 1.25e-05, 'epoch': 1.6}
{'loss': 3.3673, 'grad_norm': 12.083045959472656, 'learning_rate': 1e-05, 'epoch': 1.7}
{'loss': 3.764, 'grad_norm': 14.071247100830078, 'learning_rate': 7.5e-06, 'epoch': 1.8}
{'loss': 4.1606, 'grad_norm': 16.0623779296875, 'learning_rate': 5e-06, 'epoch': 1.9}
{'loss': 4.5572, 'grad_norm': 18.055469512939453, 'learning_rate': 2.5e-06, 'epoch': 2.0}
{'train_runtime': 0.2074, 'train_samples_per_second': 96.438, 'train_steps_per_second': 96.438, 'train_loss': 2.774213859438896, 'epoch': 2.0}