Loss metrics dramatically change after resuming from checkpoint

I am consistently getting bad loss values when resuming from a checkpoint.

Green is the original run, red and blue resumes from different checkpoints.

On 16 GPUs.
![Image](https://github.com/user-attachments/assets/03987593-35da-4944-9b57-85d6e362f5fd)

![Image](https://github.com/user-attachments/assets/e29dd0c0-7db4-4883-96bb-0ec494953aab)

On 32 GPUs.
![Image](https://github.com/user-attachments/assets/a83c51bb-1b9a-467e-bd6f-e19fad7a78b1)

Runs were done on the 8b config with these overriding flags:
```
--metrics.enable_wandb --training.compile --training.seq_len 8192 \
--training.batch_size 1 --training.tensor_parallel_degree 1 \
--training.data_parallel_shard_degree -1 --training.data_parallel_replicate_degree 1 \
--training.steps 5000 --training.warmup_steps 200 --activation_checkpoint.mode selective \
--checkpoint.interval 500 --training.seed 42
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss metrics dramatically change after resuming from checkpoint #809

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss metrics dramatically change after resuming from checkpoint #809

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions