fix(canary): lower GPU canary batch size to 16 to fix OOM by yonromai · Pull Request #4195 · marin-community/marin

yonromai · 2026-03-27T15:37:06Z

Summary

Lower CANARY_BATCH_SIZE from 32 to 16 in the CW GPU canary workflow
PR Updated MoE Baseline #4084 increased the MoE model size (8→64 experts, 512→1024 hidden_dim, 6→11 layers) without adjusting the GPU canary batch size, causing OOM on 8xH100

Smoke test results (CI H100 cluster)

Batch Size	Peak Memory	Utilization	Status
8	20.2 GB	31.7%	PASS
16	34.5 GB	54.1%	PASS
24	48.8 GB	76.6%	PASS (tight)
32	>64 GB	OOM	FAIL

batch_size=16 leaves ~29GB headroom for optimizer states and checkpointing.

Test results

Ran workflow_dispatch on this branch with target_tokens=524288 (8 steps):

Run 23654378044
Training completed all 8/8 steps with no OOM (loss=8.62, 5.4s/step after compilation)
Run cancelled after confirming success — the job was retrying due to a pre-existing profiler shutdown bug (RuntimeError: No profile started) unrelated to this change, triggered only by the very small step count

Test plan

Trigger workflow_dispatch on this branch with small target_tokens to verify no OOM
Verify nightly scheduled run passes after merge

🤖 Generated with Claude Code

PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim, 11 layers) without adjusting the GPU canary batch size. At batch_size=32 the model OOMs on 8xH100 trying to allocate 29GB per device. Smoke-tested on the CI H100 cluster: - batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving ~29GB headroom for optimizer states and checkpointing. - batch_size=32 OOMs (>64GB). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - Lower `CANARY_BATCH_SIZE` from 32 to 16 in the CW GPU canary workflow - PR #4084 increased the MoE model size (8→64 experts, 512→1024 hidden_dim, 6→11 layers) without adjusting the GPU canary batch size, causing OOM on 8xH100 ## Smoke test results (CI H100 cluster) | Batch Size | Peak Memory | Utilization | Status | |------------|-------------|-------------|--------| | 8 | 20.2 GB | 31.7% | PASS | | **16** | **34.5 GB** | **54.1%** | **PASS** | | 24 | 48.8 GB | 76.6% | PASS (tight) | | 32 | >64 GB | OOM | FAIL | batch_size=16 leaves ~29GB headroom for optimizer states and checkpointing. ## Test results Ran `workflow_dispatch` on this branch with `target_tokens=524288` (8 steps): - [Run 23654378044](https://github.com/marin-community/marin/actions/runs/23654378044) - **Training completed all 8/8 steps with no OOM** (loss=8.62, 5.4s/step after compilation) - Run cancelled after confirming success — the job was retrying due to a pre-existing profiler shutdown bug (`RuntimeError: No profile started`) unrelated to this change, triggered only by the very small step count ## Test plan - [x] Trigger `workflow_dispatch` on this branch with small `target_tokens` to verify no OOM - [ ] Verify nightly scheduled run passes after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

yonromai requested a review from rjpower March 27, 2026 16:27

yonromai merged commit 809e471 into main Mar 27, 2026
40 of 41 checks passed

yonromai deleted the fix/canary-batch-size-16 branch March 27, 2026 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(canary): lower GPU canary batch size to 16 to fix OOM#4195

fix(canary): lower GPU canary batch size to 16 to fix OOM#4195
yonromai merged 1 commit intomainfrom
fix/canary-batch-size-16

yonromai commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yonromai commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Smoke test results (CI H100 cluster)

Test results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yonromai commented Mar 27, 2026 •

edited

Loading