Skip to content

Commit 027e383

Browse files
yoblinclaude
andcommitted
fix(canary): lower GPU canary batch size to 16 to fix OOM
PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim, 11 layers) without adjusting the GPU canary batch size. At batch_size=32 the model OOMs on 8xH100 trying to allocate 29GB per device. Smoke-tested on the CI H100 cluster: - batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving ~29GB headroom for optimizer states and checkpointing. - batch_size=32 OOMs (>64GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7f309a5 commit 027e383

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
env:
3131
RUN_ID: canary-gpu-${{ github.run_id }}-${{ github.run_attempt }}
3232
CANARY_ACCELERATOR: gpu
33-
CANARY_BATCH_SIZE: "32"
33+
CANARY_BATCH_SIZE: "16"
3434
CANARY_TARGET_TOKENS: "6553600"
3535
CANARY_MIN_STEPS: "40"
3636
CANARY_MAX_LOSS: "8.0"

0 commit comments

Comments
 (0)