Commit 832130c
committed
fix(checkpoint): honor train_time_interval under manual_optimization
ModelCheckpoint silently dropped train_time_interval when the LightningModule used manual optimization. The manual-opt branch in on_train_batch_end only checked every_n_train_steps, so a callback configured with `train_time_interval=timedelta(minutes=15)` and no step trigger never fired mid-run. last.ckpt did still appear at fit completion via on_train_end, which made the bug invisible to most tests but broke any workflow that relies on mid-run saves -- chained SLURM segments resuming from epoch 0 every time, spot/preempt training losing all in-flight progress, etc.
The fix mirrors the auto-opt branch's skip_batch + skip_time logic so a save fires when either trigger is satisfied. The new regression test uses a spy callback to observe _last_global_step_saved during fit, since checking the file at end-of-run misses the bug entirely.1 parent 0e20e15 commit 832130c
3 files changed
Lines changed: 75 additions & 3 deletions
File tree
- src/lightning/pytorch
- callbacks
- tests/tests_pytorch/callbacks
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
348 | 348 | | |
349 | 349 | | |
350 | 350 | | |
351 | | - | |
352 | | - | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
353 | 369 | | |
| 370 | + | |
| 371 | + | |
354 | 372 | | |
355 | 373 | | |
356 | 374 | | |
| |||
Lines changed: 53 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
11 | | - | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
180 | 181 | | |
181 | 182 | | |
182 | 183 | | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
0 commit comments