Commit 027e383
fix(canary): lower GPU canary batch size to 16 to fix OOM
PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim,
11 layers) without adjusting the GPU canary batch size. At batch_size=32
the model OOMs on 8xH100 trying to allocate 29GB per device.
Smoke-tested on the CI H100 cluster:
- batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving
~29GB headroom for optimizer states and checkpointing.
- batch_size=32 OOMs (>64GB).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 7f309a5 commit 027e383
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| |||
0 commit comments