Makes logic in optimizers use 1-based steps. #1118

ruomingp · 2025-04-19T16:12:34Z

Resolves inconsistent definitions of "step" between SpmdTrainer and optimizers.

Background

SpmdTrainer.step is used for summaries and checkpoints and defined as:

step=0: the state after initialization. A checkpoint and summaries may be saved at step 0 to reflect the initial state before the first training step;
step=n: summaries during the n'th training step and checkpoint (if any) saved after n steps

In optimizers, the use of steps are inconsistent. Most use 0-based steps, where the first training step uses the schedule value computed from count=0:

scale_by_schedule (used for learning rate schedules)
add_decayed_weights
param_ema (different from ema)
skip_and_clip_by_global_norm uses 0-based steps to compute the drop_norm threshold
decay_bias_correction assumes a 0-based step

Resolution

After discussion, we decided to change the optimizer logic to be consistent with SpmdTrainer.step. Specifically, we change decay_bias_correction, adafactor_decay_rate, scale_by_schedule, add_decayed_weights, param_ema, and skip_and_clip_by_global_norm to assume that the first step is 1.

This will make optimizer steps to be consistent with summary steps, e.g., the summary at step N will reflect the learning rate schedule computed for step N.

… into rpang_exp1

ruomingp · 2025-05-18T13:20:38Z

This PR is ready for review now. Thanks!

ruomingp added 10 commits April 18, 2025 14:31

Adds test_scale_by_schedule_cosine_with_linear_warmup.

735bde9

Merge branch 'main' of github.pie.apple.com:foundation-models/axlearn…

73b9eba

… into rpang_exp1

Changes scale_by_schedule to start from step 1 (instead of 0).

54f5272

Changes optimizer code to use 1-based steps.

c0b6cd7

Merge branch 'main' of github.pie.apple.com:foundation-models/axlearn…

0bfa645

… into rpang_exp1

Style fix.

4dd12b0

Adds testing for adastar lr schedule.

878f909

Fixes lint.

0a3513b

Adds comments to schedule.py

f227584

Fixes learner_test.

45cf3a2

ruomingp requested review from markblee and a team as code owners April 19, 2025 16:12

ruomingp marked this pull request as draft April 19, 2025 16:12

ruomingp added 4 commits April 20, 2025 02:35

Addresses reviews

42db080

Addresses reviews

eace28d

Addresses reviews

bb0901b

Merge branch 'main' of github.pie.apple.com:foundation-models/axlearn…

b636d22

… into rpang_exp1

ruomingp marked this pull request as ready for review May 18, 2025 13:20

markblee approved these changes May 18, 2025

View reviewed changes

ruomingp added this pull request to the merge queue May 19, 2025

Merged via the queue into apple:main with commit 8d4dedf May 19, 2025
6 checks passed

ruomingp deleted the rpang_exp1 branch May 19, 2025 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Makes logic in optimizers use 1-based steps. #1118

Makes logic in optimizers use 1-based steps. #1118

Uh oh!

ruomingp commented Apr 19, 2025

Uh oh!

ruomingp commented May 18, 2025

Uh oh!

Uh oh!

Uh oh!

Makes logic in optimizers use 1-based steps. #1118

Makes logic in optimizers use 1-based steps. #1118

Uh oh!

Conversation

ruomingp commented Apr 19, 2025

Background

Resolution

Uh oh!

ruomingp commented May 18, 2025

Uh oh!

Uh oh!

Uh oh!