Makes logic in optimizers use 1-based steps. #1118
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves inconsistent definitions of "step" between SpmdTrainer and optimizers.
Background
SpmdTrainer.step is used for summaries and checkpoints and defined as:
In optimizers, the use of steps are inconsistent. Most use 0-based steps, where the first training step uses the schedule value computed from count=0:
scale_by_schedule
(used for learning rate schedules)add_decayed_weights
param_ema
(different fromema
)skip_and_clip_by_global_norm
uses 0-based steps to compute the drop_norm thresholddecay_bias_correction
assumes a 0-based stepResolution
After discussion, we decided to change the optimizer logic to be consistent with SpmdTrainer.step. Specifically, we change
decay_bias_correction
,adafactor_decay_rate
,scale_by_schedule
,add_decayed_weights
,param_ema
, andskip_and_clip_by_global_norm
to assume that the first step is 1.This will make optimizer steps to be consistent with summary steps, e.g., the summary at step N will reflect the learning rate schedule computed for step N.