Skip to content

Loss metrics dramatically change after resuming from checkpoint #809

Closed
@darkmirage

Description

@darkmirage

I am consistently getting bad loss values when resuming from a checkpoint.

Green is the original run, red and blue resumes from different checkpoints.

On 16 GPUs.
Image

Image

On 32 GPUs.
Image

Runs were done on the 8b config with these overriding flags:

--metrics.enable_wandb --training.compile --training.seq_len 8192 \
--training.batch_size 1 --training.tensor_parallel_degree 1 \
--training.data_parallel_shard_degree -1 --training.data_parallel_replicate_degree 1 \
--training.steps 5000 --training.warmup_steps 200 --activation_checkpoint.mode selective \
--checkpoint.interval 500 --training.seed 42

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestmodule: checkpointrelease blockingIssues that are blocking the milestone / release completion

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions