Merge pull request #324 from alan-turing-institute/calculate-cosine-epochs

sgreenbury · web-flow · commit 9af9ed5c03ad · 2026-04-16T21:18:34.000+01:00
Calculate epoch timings
diff --git a/docs/SCRIPTS_AND_CONFIGS.md b/docs/SCRIPTS_AND_CONFIGS.md
@@ -141,6 +141,140 @@ For launching many prewritten runs from a manifest list:
 bash scripts/launch_from_manifest.sh run_manifests/example_runs.txt
 ```
 
+### Timing epochs and computing `max_epochs` for cosine schedules
+
+When using the `adamw_half` optimizer (half-period cosine LR schedule), the
+learning rate decays from its initial value to zero over exactly
+`trainer.max_epochs` epochs.  If training is cut short by `trainer.max_time`
+before all epochs complete, the schedule will not have reached zero.
+
+The `time-epochs` subcommand solves this by running a short timing run (a few
+epochs), measuring per-epoch wall-clock duration, and computing the
+`max_epochs` that fits within a given budget:
+
+```bash
+# Time 3 EPD epochs (default) and compute max_epochs for a 24h budget
+uv run autocast time-epochs datamodule=advection_diffusion_multichannel
+
+# Time an autoencoder run
+uv run autocast time-epochs --kind ae datamodule=reaction_diffusion
+
+# Time a processor run
+uv run autocast time-epochs --kind processor datamodule=reaction_diffusion
+
+# Custom: 5 timing epochs, 12h budget, 2% safety margin
+uv run autocast time-epochs -n 5 -b 12 -m 0.02 \
+    datamodule=shallow_water2d
+
+# With experiment overrides
+uv run autocast time-epochs experiment=epd_crps_vit_large_ps4_64
+
+# Dry-run to inspect the generated command
+uv run autocast time-epochs --dry-run datamodule=reaction_diffusion
+```
+
+`--kind` selects the training type to time: `ae`, `epd` (default), or
+`processor`.  Use the same kind you intend to train so that the per-epoch
+measurement reflects the actual model and data pipeline.
+
+#### Batch timing via SLURM
+
+With `--mode slurm` the timing run is submitted as a SLURM job and the CLI
+exits immediately, printing a follow-up command to retrieve results once the
+job completes:
+
+```bash
+# Submit timing jobs for several configs at once
+uv run autocast time-epochs --mode slurm --kind ae \
+    datamodule=reaction_diffusion --run-group timing
+uv run autocast time-epochs --mode slurm --kind epd \
+    datamodule=shallow_water2d --run-group timing \
+    experiment=epd_crps_vit_large_ps4_64
+
+# Once the SLURM jobs finish, compute results from the checkpoints
+uv run autocast time-epochs --from-checkpoint outputs/timing/ae_.../timing.ckpt
+uv run autocast time-epochs --from-checkpoint outputs/timing/epd_.../timing.ckpt
+```
+
+`--from-checkpoint` reads an existing checkpoint, extracts the per-epoch
+times, and prints the recommendation — no training is run.  You can also
+use it to recompute with a different budget or margin:
+
+```bash
+uv run autocast time-epochs --from-checkpoint outputs/timing/epd_.../timing.ckpt \
+    -b 12 -m 0.05
+```
+
+The output includes recommended Hydra overrides ready to copy-paste:
+
+```
+============================================================
+  Seconds/epoch:  150.0s
+  Budget:         24.0h (margin: 2%)
+  max_epochs:     564
+  Expected time:  23.5h
+  Headroom:       0.5h
+============================================================
+
+Recommended overrides:
+  trainer.max_epochs=564 trainer.max_time=01:00:00:00 optimizer=adamw_half
+```
+
+The calculation is conservative:
+- A 2% safety margin (configurable with `-m`) is subtracted from the budget.
+- The result is rounded **down** to a whole epoch (`floor`), so the cosine
+  schedule always completes its full half-period.
+- `trainer.max_time` is set to the full (un-margined) budget as a hard stop.
+
+Per-epoch times are extracted from the `TrainingTimerCallback` saved in the
+checkpoint, which excludes model setup and data loading overhead.
+
+#### How `max_epochs` and `max_time` interact at runtime
+
+The recommended overrides set **two** stopping conditions:
+
+| Condition | Controlled by | What happens |
+|---|---|---|
+| Epoch limit | `trainer.max_epochs` | Training stops cleanly after completing this many epochs. |
+| Wall-clock limit | `trainer.max_time` | Lightning hard-stops training when the clock runs out. |
+
+Lightning stops at whichever fires first.
+
+**Faster than expected** (each epoch takes less time than the timing run
+measured): `max_epochs` fires first.  All epochs complete, and the cosine LR
+schedule reaches exactly zero.  `max_time` is never triggered.  This is the
+ideal outcome.
+
+**Slower than expected** (each epoch takes more time): `max_time` fires first,
+cutting training short before all `max_epochs` have completed.  The cosine
+schedule has *not* reached zero — the final LR is positive.
+
+The 2% default margin tolerates up to ~2% slower epochs before `max_time`
+intervenes.  The `floor()` rounding adds a small additional buffer (up to
+one epoch's worth).  For workloads where epoch duration is stable
+(compute-bound, data in memory), 2% is sufficient.  For I/O-bound workloads
+that stream from a shared parallel filesystem, consider `--margin 0.05` or
+higher.
+
+**The cosine cannot overshoot and start increasing.**
+`cosine_lambda(t) = 0.5 * (1 + cos(pi * t / max_epochs))` is monotonically
+decreasing over `[0, max_epochs]`.  Training terminates at `max_epochs`, so
+the second half of the cosine period is never entered.  If `max_time`
+intervenes earlier, the LR is still on the decreasing branch — it simply
+hasn't reached zero yet.
+
+#### Choosing a margin
+
+| Scenario | Recommended `--margin` |
+|---|---|
+| Data in memory, single GPU (very stable epoch times) | 0.02 (default) |
+| Local NVMe data loading | 0.02 – 0.03 |
+| Streaming from Lustre / GPFS | 0.05 – 0.10 |
+
+To empirically check variance, run `time-epochs` twice at different cluster
+load levels.  If the two per-epoch estimates agree within 3%, 2% margin is
+safe.  If they diverge more, match the margin to the observed variance.
+
 ## Lower-level script entry points (advanced)
 
 AutoCast uses a set of Python scripts located in `src/autocast/scripts/` as entry points for training and evaluation. These scripts are exposed as CLI commands via `pyproject.toml`.
diff --git a/src/autocast/scripts/training.py b/src/autocast/scripts/training.py
@@ -269,9 +269,14 @@ def _attach_reset_timer_callback(
 class TrainingTimerCallback(Callback):
     """Measures wall-clock training time and persists it to the checkpoint.
 
-    Records total training time and per-epoch durations.  The values are
-    stored via ``state_dict()`` so the eval script can read them directly
-    from the checkpoint's ``callbacks`` block.
+    Records total training time and per-epoch durations.  Each epoch
+    measurement spans the **full cycle** — training batches *and* the
+    subsequent validation loop — so that the ``time-epochs`` command can
+    accurately predict wall-clock budget consumption.
+
+    Epoch boundaries are measured from one ``on_train_epoch_start`` to the
+    next; the final epoch is closed out in ``on_train_end`` (which fires
+    after the last validation loop).
 
     Note
     ----
@@ -305,19 +310,20 @@ def on_train_epoch_start(
         self, trainer: L.Trainer, pl_module: L.LightningModule
     ) -> None:
         del trainer, pl_module
-        self._epoch_start = perf_counter()
-
-    def on_train_epoch_end(
-        self, trainer: L.Trainer, pl_module: L.LightningModule
-    ) -> None:
-        del trainer, pl_module
+        now = perf_counter()
+        # Close out the *previous* epoch (training + validation + overhead).
         if self._epoch_start is not None:
-            self._epoch_times_s.append(perf_counter() - self._epoch_start)
+            self._epoch_times_s.append(now - self._epoch_start)
+        self._epoch_start = now
 
     def on_train_end(self, trainer: L.Trainer, pl_module: L.LightningModule) -> None:
         del trainer, pl_module
+        now = perf_counter()
+        # Close out the final epoch (includes its validation loop).
+        if self._epoch_start is not None:
+            self._epoch_times_s.append(now - self._epoch_start)
         if self._train_start is not None:
-            self.training_runtime_total_s = perf_counter() - self._train_start
+            self.training_runtime_total_s = now - self._train_start
 
     def state_dict(self) -> dict:  # type: ignore[override]
         runtime_elapsed_s = self._current_elapsed_runtime_s()
@@ -570,6 +576,7 @@ def train_autoencoder(
     trainer = instantiate(
         trainer_cfg, logger=wandb_logger, default_root_dir=str(work_dir)
     )
+    trainer.callbacks.append(TrainingTimerCallback())
     output_cfg = config.get("output", {})
     if output_cfg.get("save_config", False) and trainer.is_global_zero:
         save_resolved_config(
@@ -622,12 +629,11 @@ def train_autoencoder(
         log.info("Starting training from scratch (no resume checkpoint).")
         trainer.fit(model=model, datamodule=datamodule)
 
-    checkpoint_name = output_cfg.get("checkpoint_name", "autoencoder.ckpt")
-    checkpoint_target = Path(checkpoint_name)
-    checkpoint_path = (
-        checkpoint_target
-        if checkpoint_target.is_absolute()
-        else (work_dir / checkpoint_target)
+    checkpoint_path = _resolve_checkpoint_path(
+        work_dir,
+        output_cfg,
+        output_cfg.get("checkpoint_path"),
+        default_name="autoencoder.ckpt",
     )
     if trainer.is_global_zero:
         trainer.save_checkpoint(checkpoint_path)
diff --git a/src/autocast/scripts/workflow/cli.py b/src/autocast/scripts/workflow/cli.py
@@ -12,6 +12,7 @@
     eval_command,
     infer_dataset_from_workdir,
     infer_resume_checkpoint,
+    time_epochs_command,
     train_command,
     train_eval_single_job_command,
 )
@@ -154,6 +155,53 @@ def build_parser() -> argparse.ArgumentParser:
     )
     _add_common_args(cache_parser)
 
+    # -- time-epochs -------------------------------------------------------
+    time_parser = subparsers.add_parser(
+        "time-epochs",
+        description=(
+            "Run a short training (ae, epd, or processor) to time per-epoch "
+            "duration and compute the recommended trainer.max_epochs for a "
+            "cosine half-period schedule within a given wall-clock budget."
+        ),
+    )
+    _add_train_args(time_parser)
+    time_parser.add_argument(
+        "--kind",
+        choices=["ae", "epd", "processor"],
+        default="epd",
+        help="Training kind to time (default: epd).",
+    )
+    time_parser.add_argument(
+        "-n",
+        "--num-epochs",
+        type=int,
+        default=3,
+        help="Number of epochs to run for timing (default: 3).",
+    )
+    time_parser.add_argument(
+        "-b",
+        "--budget",
+        type=float,
+        default=24.0,
+        help="Wall-clock budget in hours (default: 24).",
+    )
+    time_parser.add_argument(
+        "-m",
+        "--margin",
+        type=float,
+        default=0.02,
+        help="Safety margin fraction subtracted from budget (default: 0.02 = 2%%).",
+    )
+    time_parser.add_argument(
+        "--from-checkpoint",
+        metavar="CKPT",
+        help=(
+            "Path to an existing timing checkpoint. Skips training and "
+            "computes the recommendation directly."
+        ),
+    )
+    _add_common_args(time_parser)
+
     return parser
 
 
@@ -320,6 +368,30 @@ def main() -> None:
         )
         return
 
+    if args.command == "time-epochs":
+        dataset = _resolve_dataset(
+            work_dir=args.workdir,
+            overrides=combined_overrides,
+        )
+
+        time_epochs_command(
+            kind=args.kind,
+            mode=args.mode,
+            dataset=dataset,
+            output_base=args.output_base,
+            overrides=combined_overrides,
+            num_epochs=args.num_epochs,
+            budget_hours=args.budget,
+            margin=args.margin,
+            run_group=args.run_group,
+            run_id=args.run_id,
+            work_dir=args.workdir,
+            from_checkpoint=args.from_checkpoint,
+            runtime_typechecking=args.runtime_typechecking,
+            dry_run=args.dry_run,
+        )
+        return
+
     raise ValueError(f"Unsupported command: {args.command}")
 
 
diff --git a/src/autocast/scripts/workflow/commands.py b/src/autocast/scripts/workflow/commands.py