Skip to content

fix(canary): lower GPU canary batch size to 16 to fix OOM#4195

Merged
yonromai merged 1 commit intomainfrom
fix/canary-batch-size-16
Mar 27, 2026
Merged

fix(canary): lower GPU canary batch size to 16 to fix OOM#4195
yonromai merged 1 commit intomainfrom
fix/canary-batch-size-16

Conversation

@yonromai
Copy link
Copy Markdown
Contributor

@yonromai yonromai commented Mar 27, 2026

Summary

  • Lower CANARY_BATCH_SIZE from 32 to 16 in the CW GPU canary workflow
  • PR Updated MoE Baseline #4084 increased the MoE model size (8→64 experts, 512→1024 hidden_dim, 6→11 layers) without adjusting the GPU canary batch size, causing OOM on 8xH100

Smoke test results (CI H100 cluster)

Batch Size Peak Memory Utilization Status
8 20.2 GB 31.7% PASS
16 34.5 GB 54.1% PASS
24 48.8 GB 76.6% PASS (tight)
32 >64 GB OOM FAIL

batch_size=16 leaves ~29GB headroom for optimizer states and checkpointing.

Test results

Ran workflow_dispatch on this branch with target_tokens=524288 (8 steps):

  • Run 23654378044
  • Training completed all 8/8 steps with no OOM (loss=8.62, 5.4s/step after compilation)
  • Run cancelled after confirming success — the job was retrying due to a pre-existing profiler shutdown bug (RuntimeError: No profile started) unrelated to this change, triggered only by the very small step count

Test plan

  • Trigger workflow_dispatch on this branch with small target_tokens to verify no OOM
  • Verify nightly scheduled run passes after merge

🤖 Generated with Claude Code

PR #4084 increased the MoE model size (64 experts, 1024 hidden_dim,
11 layers) without adjusting the GPU canary batch size. At batch_size=32
the model OOMs on 8xH100 trying to allocate 29GB per device.

Smoke-tested on the CI H100 cluster:
- batch_size=16 peaks at 34.5GB/device (54% of 64GB usable), leaving
  ~29GB headroom for optimizer states and checkpointing.
- batch_size=32 OOMs (>64GB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yonromai yonromai requested a review from rjpower March 27, 2026 16:27
@yonromai yonromai merged commit 809e471 into main Mar 27, 2026
40 of 41 checks passed
@yonromai yonromai deleted the fix/canary-batch-size-16 branch March 27, 2026 16:27
Helw150 pushed a commit that referenced this pull request Apr 8, 2026
## Summary
- Lower `CANARY_BATCH_SIZE` from 32 to 16 in the CW GPU canary workflow
- PR #4084 increased the MoE model size (8→64 experts, 512→1024
hidden_dim, 6→11 layers) without adjusting the GPU canary batch size,
causing OOM on 8xH100

## Smoke test results (CI H100 cluster)

| Batch Size | Peak Memory | Utilization | Status |
|------------|-------------|-------------|--------|
| 8 | 20.2 GB | 31.7% | PASS |
| **16** | **34.5 GB** | **54.1%** | **PASS** |
| 24 | 48.8 GB | 76.6% | PASS (tight) |
| 32 | >64 GB | OOM | FAIL |

batch_size=16 leaves ~29GB headroom for optimizer states and
checkpointing.

## Test results

Ran `workflow_dispatch` on this branch with `target_tokens=524288` (8
steps):
- [Run
23654378044](https://github.com/marin-community/marin/actions/runs/23654378044)
- **Training completed all 8/8 steps with no OOM** (loss=8.62, 5.4s/step
after compilation)
- Run cancelled after confirming success — the job was retrying due to a
pre-existing profiler shutdown bug (`RuntimeError: No profile started`)
unrelated to this change, triggered only by the very small step count

## Test plan
- [x] Trigger `workflow_dispatch` on this branch with small
`target_tokens` to verify no OOM
- [ ] Verify nightly scheduled run passes after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: yoblin <268258002+yoblin@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants