Skip to content

epoch in the log message uses a wrong denominator under some conditions #41913

@nzw0301

Description

@nzw0301

System Info

  • transformers version: 4.57.1
  • Platform: macOS-26.0.1-arm64-arm-64bit
  • Python version: 3.12.0
  • Huggingface_hub version: 0.35.3
  • Safetensors version: 0.5.3
  • Accelerate version: 1.7.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.6.0 (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Run the following code:

import torch
from datasets import Dataset
from torch import nn
from transformers import Trainer, TrainingArguments

class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 2)

    def forward(self, a, return_loss=True):
        output = self.linear(a)
        return {"loss": output.sum()}

data = torch.tensor([[i, i] for i in range(10)], dtype=torch.float32)  # [[0., 0.], [1., 1.], [2., 2.], ...]
dataset = Dataset.from_dict({"a": data}).to_iterable_dataset()  # finite iterable dataset

args = TrainingArguments(output_dir=".", per_device_train_batch_size=1, max_steps=20, logging_steps=1)
trainer = Trainer(model=MyModule(), args=args, train_dataset=dataset)
trainer.train()
{'loss': 0.9867, 'grad_norm': 1.4142135381698608, 'learning_rate': 5e-05, 'epoch': 0.05}
{'loss': 1.3851, 'grad_norm': 2.4494898319244385, 'learning_rate': 4.75e-05, 'epoch': 0.1}
{'loss': 1.7833, 'grad_norm': 4.242640495300293, 'learning_rate': 4.5e-05, 'epoch': 0.15}
{'loss': 2.1812, 'grad_norm': 6.164413928985596, 'learning_rate': 4.25e-05, 'epoch': 0.2}
{'loss': 2.5788, 'grad_norm': 8.124038696289062, 'learning_rate': 4e-05, 'epoch': 0.25}
{'loss': 2.9761, 'grad_norm': 10.099504470825195, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.3}
{'loss': 3.3731, 'grad_norm': 12.083045959472656, 'learning_rate': 3.5e-05, 'epoch': 0.35}
{'loss': 3.7699, 'grad_norm': 14.071247100830078, 'learning_rate': 3.2500000000000004e-05, 'epoch': 0.4}
{'loss': 4.1665, 'grad_norm': 16.0623779296875, 'learning_rate': 3e-05, 'epoch': 0.45}
{'loss': 4.563, 'grad_norm': 18.055469512939453, 'learning_rate': 2.7500000000000004e-05, 'epoch': 0.5}
{'loss': 0.9861, 'grad_norm': 1.4142135381698608, 'learning_rate': 2.5e-05, 'epoch': 1.05}
{'loss': 1.3833, 'grad_norm': 2.4494898319244385, 'learning_rate': 2.25e-05, 'epoch': 1.1}
{'loss': 1.7803, 'grad_norm': 4.242640495300293, 'learning_rate': 2e-05, 'epoch': 1.15}
{'loss': 2.1772, 'grad_norm': 6.164413928985596, 'learning_rate': 1.75e-05, 'epoch': 1.2}
{'loss': 2.574, 'grad_norm': 8.124038696289062, 'learning_rate': 1.5e-05, 'epoch': 1.25}
{'loss': 2.9707, 'grad_norm': 10.099504470825195, 'learning_rate': 1.25e-05, 'epoch': 1.3}
{'loss': 3.3673, 'grad_norm': 12.083045959472656, 'learning_rate': 1e-05, 'epoch': 1.35}
{'loss': 3.764, 'grad_norm': 14.071247100830078, 'learning_rate': 7.5e-06, 'epoch': 1.4}
{'loss': 4.1606, 'grad_norm': 16.0623779296875, 'learning_rate': 5e-06, 'epoch': 1.45}
{'loss': 4.5572, 'grad_norm': 18.055469512939453, 'learning_rate': 2.5e-06, 'epoch': 1.5}
{'train_runtime': 0.2074, 'train_samples_per_second': 96.438, 'train_steps_per_second': 96.438, 'train_loss': 2.774213859438896, 'epoch': 1.5}

In my understanding, epoch is computed at

self.state.epoch = epoch + (step + 1) / steps_in_epoch
and the denominator: steps_in_epoch is initialised with args.max_steps at
else args.max_steps * args.gradient_accumulation_steps
when dataset has no __len__, like the example above

Expected behavior

{'loss': 0.9867, 'grad_norm': 1.4142135381698608, 'learning_rate': 5e-05, 'epoch': 0.1}
{'loss': 1.3851, 'grad_norm': 2.4494898319244385, 'learning_rate': 4.75e-05, 'epoch': 0.2}
{'loss': 1.7833, 'grad_norm': 4.242640495300293, 'learning_rate': 4.5e-05, 'epoch': 0.3}
{'loss': 2.1812, 'grad_norm': 6.164413928985596, 'learning_rate': 4.25e-05, 'epoch': 0.4}
{'loss': 2.5788, 'grad_norm': 8.124038696289062, 'learning_rate': 4e-05, 'epoch': 0.5}
{'loss': 2.9761, 'grad_norm': 10.099504470825195, 'learning_rate': 3.7500000000000003e-05, 'epoch': 0.6}
{'loss': 3.3731, 'grad_norm': 12.083045959472656, 'learning_rate': 3.5e-05, 'epoch': 0.7}
{'loss': 3.7699, 'grad_norm': 14.071247100830078, 'learning_rate': 3.2500000000000004e-05, 'epoch': 0.8}
{'loss': 4.1665, 'grad_norm': 16.0623779296875, 'learning_rate': 3e-05, 'epoch': 0.9}
{'loss': 4.563, 'grad_norm': 18.055469512939453, 'learning_rate': 2.7500000000000004e-05, 'epoch': 1.0}
{'loss': 0.9861, 'grad_norm': 1.4142135381698608, 'learning_rate': 2.5e-05, 'epoch': 1.1}
{'loss': 1.3833, 'grad_norm': 2.4494898319244385, 'learning_rate': 2.25e-05, 'epoch': 1.2}
{'loss': 1.7803, 'grad_norm': 4.242640495300293, 'learning_rate': 2e-05, 'epoch': 1.3}
{'loss': 2.1772, 'grad_norm': 6.164413928985596, 'learning_rate': 1.75e-05, 'epoch': 1.4}
{'loss': 2.574, 'grad_norm': 8.124038696289062, 'learning_rate': 1.5e-05, 'epoch': 1.5}
{'loss': 2.9707, 'grad_norm': 10.099504470825195, 'learning_rate': 1.25e-05, 'epoch': 1.6}
{'loss': 3.3673, 'grad_norm': 12.083045959472656, 'learning_rate': 1e-05, 'epoch': 1.7}
{'loss': 3.764, 'grad_norm': 14.071247100830078, 'learning_rate': 7.5e-06, 'epoch': 1.8}
{'loss': 4.1606, 'grad_norm': 16.0623779296875, 'learning_rate': 5e-06, 'epoch': 1.9}
{'loss': 4.5572, 'grad_norm': 18.055469512939453, 'learning_rate': 2.5e-06, 'epoch': 2.0}
{'train_runtime': 0.2074, 'train_samples_per_second': 96.438, 'train_steps_per_second': 96.438, 'train_loss': 2.774213859438896, 'epoch': 2.0}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions