Skip to content

Loss metrics dramatically change after resuming from checkpoint #809

Open
@darkmirage

Description

I am consistently getting bad loss values when resuming from a checkpoint.

Green is the original run, red and blue resumes from different checkpoints.

On 16 GPUs.
Image

Image

On 32 GPUs.
Image

Runs were done on the 8b config with these overriding flags:

--metrics.enable_wandb --training.compile --training.seq_len 8192 \
--training.batch_size 1 --training.tensor_parallel_degree 1 \
--training.data_parallel_shard_degree -1 --training.data_parallel_replicate_degree 1 \
--training.steps 5000 --training.warmup_steps 200 --activation_checkpoint.mode selective \
--checkpoint.interval 500 --training.seed 42

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions