Skip to content

Commit 8d37bdd

Browse files
Add model checkpointing
1 parent acbd68d commit 8d37bdd

File tree

4 files changed

+20
-12
lines changed

4 files changed

+20
-12
lines changed

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,13 @@ You can override any parameter from command line like this
9898
```bash
9999
uv run python glm_experiments/train.py trainer.max_epochs=20 data.batch_size=64
100100
```
101+
102+
### Loading a Checkpoint
103+
104+
```python
105+
# replace with CLMLitModule if necessary
106+
from glm_experiments.models.lm_lit_module import MLMLitModule
107+
108+
# Load plain pytorch model from checkpoint
109+
model = MLMLitModule.load_from_checkpoint("logs/train/runs/<timestamp>/checkpoints/{step}.ckpt").net
110+
```

configs/callbacks/default.yaml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,19 @@ defaults:
22
- model_summary
33
- step_progress_bar
44
- lr_monitor
5+
- model_checkpoint
56
- _self_
67

78
model_checkpoint:
89
dirpath: ${paths.output_dir}/checkpoints
9-
filename: "epoch_{epoch:03d}"
10-
monitor: "val/acc"
11-
mode: "max"
12-
save_last: True
13-
auto_insert_metric_name: False
14-
15-
early_stopping:
16-
monitor: "val/acc"
17-
patience: 100
18-
mode: "max"
10+
filename: "{step}"
11+
monitor: null
12+
save_last: false # prefer the explicit step-based checkpointing
13+
save_top_k: -1
14+
auto_insert_metric_name: false
15+
every_n_epochs: 0
16+
verbose: true
17+
every_n_train_steps: ${trainer.max_steps}
1918

2019
model_summary:
2120
max_depth: 1
File renamed without changes.

glm_experiments/models/lm_lit_module.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,8 +66,7 @@ def __init__(
6666
):
6767
super().__init__()
6868

69-
# Save hyperparameters (excluding net for cleaner logs)
70-
self.save_hyperparameters(ignore=["net"], logger=False)
69+
self.save_hyperparameters(logger=False)
7170

7271
self.net = net
7372

0 commit comments

Comments
 (0)