fix(canary): lower GPU canary batch size to 16 to fix OOM#4195
Merged
Conversation
PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim, 11 layers) without adjusting the GPU canary batch size. At batch_size=32 the model OOMs on 8xH100 trying to allocate 29GB per device. Smoke-tested on the CI H100 cluster: - batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving ~29GB headroom for optimizer states and checkpointing. - batch_size=32 OOMs (>64GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Helw150
pushed a commit
that referenced
this pull request
Apr 8, 2026
## Summary - Lower `CANARY_BATCH_SIZE` from 32 to 16 in the CW GPU canary workflow - PR #4084 increased the MoE model size (8→64 experts, 512→1024 hidden_dim, 6→11 layers) without adjusting the GPU canary batch size, causing OOM on 8xH100 ## Smoke test results (CI H100 cluster) | Batch Size | Peak Memory | Utilization | Status | |------------|-------------|-------------|--------| | 8 | 20.2 GB | 31.7% | PASS | | **16** | **34.5 GB** | **54.1%** | **PASS** | | 24 | 48.8 GB | 76.6% | PASS (tight) | | 32 | >64 GB | OOM | FAIL | batch_size=16 leaves ~29GB headroom for optimizer states and checkpointing. ## Test results Ran `workflow_dispatch` on this branch with `target_tokens=524288` (8 steps): - [Run 23654378044](https://github.com/marin-community/marin/actions/runs/23654378044) - **Training completed all 8/8 steps with no OOM** (loss=8.62, 5.4s/step after compilation) - Run cancelled after confirming success — the job was retrying due to a pre-existing profiler shutdown bug (`RuntimeError: No profile started`) unrelated to this change, triggered only by the very small step count ## Test plan - [x] Trigger `workflow_dispatch` on this branch with small `target_tokens` to verify no OOM - [ ] Verify nightly scheduled run passes after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CANARY_BATCH_SIZEfrom 32 to 16 in the CW GPU canary workflowSmoke test results (CI H100 cluster)
batch_size=16 leaves ~29GB headroom for optimizer states and checkpointing.
Test results
Ran
workflow_dispatchon this branch withtarget_tokens=524288(8 steps):RuntimeError: No profile started) unrelated to this change, triggered only by the very small step countTest plan
workflow_dispatchon this branch with smalltarget_tokensto verify no OOM🤖 Generated with Claude Code