You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- A 2% safety margin (configurable with `-m`) is subtracted from the budget.
225
+
- The result is rounded **down** to a whole epoch (`floor`), so the cosine
226
+
schedule always completes its full half-period.
227
+
-`trainer.max_time` is set to the full (un-margined) budget as a hard stop.
228
+
229
+
Per-epoch times are extracted from the `TrainingTimerCallback` saved in the
230
+
checkpoint, which excludes model setup and data loading overhead.
231
+
232
+
#### How `max_epochs` and `max_time` interact at runtime
233
+
234
+
The recommended overrides set **two** stopping conditions:
235
+
236
+
| Condition | Controlled by | What happens |
237
+
|---|---|---|
238
+
| Epoch limit |`trainer.max_epochs`| Training stops cleanly after completing this many epochs. |
239
+
| Wall-clock limit |`trainer.max_time`| Lightning hard-stops training when the clock runs out. |
240
+
241
+
Lightning stops at whichever fires first.
242
+
243
+
**Faster than expected** (each epoch takes less time than the timing run
244
+
measured): `max_epochs` fires first. All epochs complete, and the cosine LR
245
+
schedule reaches exactly zero. `max_time` is never triggered. This is the
246
+
ideal outcome.
247
+
248
+
**Slower than expected** (each epoch takes more time): `max_time` fires first,
249
+
cutting training short before all `max_epochs` have completed. The cosine
250
+
schedule has *not* reached zero — the final LR is positive.
251
+
252
+
The 2% default margin tolerates up to ~2% slower epochs before `max_time`
253
+
intervenes. The `floor()` rounding adds a small additional buffer (up to
254
+
one epoch's worth). For workloads where epoch duration is stable
255
+
(compute-bound, data in memory), 2% is sufficient. For I/O-bound workloads
256
+
that stream from a shared parallel filesystem, consider `--margin 0.05` or
257
+
higher.
258
+
259
+
**The cosine cannot overshoot and start increasing.**
260
+
`cosine_lambda(t) = 0.5 * (1 + cos(pi * t / max_epochs))` is monotonically
261
+
decreasing over `[0, max_epochs]`. Training terminates at `max_epochs`, so
262
+
the second half of the cosine period is never entered. If `max_time`
263
+
intervenes earlier, the LR is still on the decreasing branch — it simply
264
+
hasn't reached zero yet.
265
+
266
+
#### Choosing a margin
267
+
268
+
| Scenario | Recommended `--margin`|
269
+
|---|---|
270
+
| Data in memory, single GPU (very stable epoch times) | 0.02 (default) |
271
+
| Local NVMe data loading | 0.02 – 0.03 |
272
+
| Streaming from Lustre / GPFS | 0.05 – 0.10 |
273
+
274
+
To empirically check variance, run `time-epochs` twice at different cluster
275
+
load levels. If the two per-epoch estimates agree within 3%, 2% margin is
276
+
safe. If they diverge more, match the margin to the observed variance.
277
+
144
278
## Lower-level script entry points (advanced)
145
279
146
280
AutoCast uses a set of Python scripts located in `src/autocast/scripts/` as entry points for training and evaluation. These scripts are exposed as CLI commands via `pyproject.toml`.
0 commit comments