Skip to content

[QUESTION] resume from checkpoint and increase the data Parallelism size #2829

@bugm

Description

@bugm

Hello,

Suppose I am training a model with global_batch_size = 16, micro_batch_size= 1 and 8 GPUs (dp=8) and save a checkpoint for iteration 50 and stop
Then I resume training from this checkpoint with 16 GPUS (dp=16)
I wonder is this resume training completely equivalent to the previous training from 50 iterations (considering the rng state or something else)?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions