fix(canary): lower GPU canary batch size to 16 to fix OOM

yoblin · claude · yoblin · commit 027e383fa35c · 2026-03-27T15:36:51.000Z
PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim, 11 layers) without adjusting the GPU canary batch size. At batch_size=32 the model OOMs on 8xH100 trying to allocate 29GB per device. Smoke-tested on the CI H100 cluster: - batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving ~29GB headroom for optimizer states and checkpointing. - batch_size=32 OOMs (>64GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/.github/workflows/marin-canary-ferry-cw.yaml b/.github/workflows/marin-canary-ferry-cw.yaml
@@ -30,7 +30,7 @@ jobs:
     env:
       RUN_ID: canary-gpu-${{ github.run_id }}-${{ github.run_attempt }}
       CANARY_ACCELERATOR: gpu
-      CANARY_BATCH_SIZE: "32"
+      CANARY_BATCH_SIZE: "16"
       CANARY_TARGET_TOKENS: "6553600"
       CANARY_MIN_STEPS: "40"
       CANARY_MAX_LOSS: "8.0"