Loss metrics dramatically change after resuming from checkpoint #809
Open
Description
I am consistently getting bad loss values when resuming from a checkpoint.
Green is the original run, red and blue resumes from different checkpoints.
Runs were done on the 8b config with these overriding flags:
--metrics.enable_wandb --training.compile --training.seq_len 8192 \
--training.batch_size 1 --training.tensor_parallel_degree 1 \
--training.data_parallel_shard_degree -1 --training.data_parallel_replicate_degree 1 \
--training.steps 5000 --training.warmup_steps 200 --activation_checkpoint.mode selective \
--checkpoint.interval 500 --training.seed 42