Skip to content

Calculate epoch timings#324

Merged
sgreenbury merged 9 commits intomainfrom
calculate-cosine-epochs
Apr 16, 2026
Merged

Calculate epoch timings#324
sgreenbury merged 9 commits intomainfrom
calculate-cosine-epochs

Conversation

@sgreenbury
Copy link
Copy Markdown
Contributor

This pull request adds a new workflow command for timing training epochs and recommending max_epochs for cosine learning rate schedules, along with related documentation and support code. The main goal is to help users accurately set trainer.max_epochs so that the cosine schedule completes exactly within a given wall-clock budget, with a safety margin. The implementation includes both CLI and SLURM support, and provides detailed user guidance.

The most important changes include:

New Feature: Epoch Timing and Schedule Recommendation

  • Adds the time-epochs command to the CLI (src/autocast/scripts/workflow/cli.py, src/autocast/scripts/workflow/commands.py), which runs a short training job to measure per-epoch duration and computes the recommended trainer.max_epochs for a cosine half-period schedule within a user-specified wall-clock budget. [1] [2] [3]
  • Supports both local and SLURM execution, and allows re-computation from a saved timing checkpoint for reproducibility and batch job workflows.
  • Provides a dry-run mode to preview generated commands and recommendations without running any training.

Documentation

  • Adds a comprehensive section to docs/SCRIPTS_AND_CONFIGS.md explaining how to use the new timing feature, including example commands, SLURM integration, margin selection, and the interaction between max_epochs and max_time.

Training Script Support

  • Ensures the TrainingTimerCallback is always attached in autoencoder training so that per-epoch timing data is available in checkpoints.
  • Refactors checkpoint path resolution in the training script to support the new timing workflow and checkpoint naming conventions.

These changes make it easy and robust to calculate the correct number of epochs for cosine schedules, improving reproducibility and efficient use of compute budgets.

- Remove time-epochs and compute_cosine_epochs scripts
- Integrate time-epochs functionality into CLI and commands
The timer previously measured only the training loop (on_train_epoch_start
to on_train_epoch_end), excluding validation. This caused the time-epochs
command to underestimate per-epoch duration and over-predict max_epochs,
risking max_time cutting training short before the cosine schedule
completes.

Now measures from one on_train_epoch_start to the next (with the final
epoch closed by on_train_end), capturing training batches, validation
batches, and any inter-epoch overhead.
@sgreenbury sgreenbury force-pushed the calculate-cosine-epochs branch from 78922a9 to 9a87a02 Compare April 16, 2026 16:10
Correct max_time output to DD:HH:MM:SS and add validation for
budget, margin, and num_epochs. Update docs example to match.
@sgreenbury sgreenbury merged commit 9af9ed5 into main Apr 16, 2026
3 checks passed
@sgreenbury sgreenbury deleted the calculate-cosine-epochs branch April 16, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant