[Docs] Add HBM optimization guide and cross-links (#3595)

dlwh · web-flow · commit c4bdde461f0d · 2026-03-13T22:22:10.000-07:00
Add a technical reference guide for fitting JAX/Levanter/Haliax training in HBM and link it from high-signal docs so OOM guidance is easy to find. - Add docs/references/hbm-optimization.md covering sharding, activation checkpointing/offloading, optimizer offload, nested sqrt(n) remat, batch/sequence controls, and practical memory tuning tips. - Add a Technical Reference nav entry in mkdocs.yml. - Link the guide from experiments/grug/README.md, docs/tutorials/train-an-lm.md, docs/tutorials/local-gpu.md, and .agents/skills/change-grug/SKILL.md. Testing: - ./infra/pre-commit.py --all-files --fix Fixes #3594
diff --git a/.agents/skills/change-grug/SKILL.md b/.agents/skills/change-grug/SKILL.md
@@ -81,6 +81,7 @@ Keep it grug-style:
 - Equinox modules with `init` + `__call__`
 - minimal config knobs
 - keep legibility first; if a block gets hard to read, introduce a small local helper instead of adding framework indirection
+- when HBM is tight, use `docs/references/hbm-optimization.md` before introducing bespoke memory hacks
 
 ### 5) Delete stale paths
 
diff --git a/docs/references/hbm-optimization.md b/docs/references/hbm-optimization.md
@@ -0,0 +1,175 @@
+# Making Things Fit in HBM
+
+This guide is a practical checklist for JAX/Levanter/Haliax training runs that are close to OOM.
+
+The main knobs are:
+
+1. Shard more.
+2. Checkpoint and offload activations.
+3. Offload optimizer/parameter state.
+4. Use model parallelism where it actually helps.
+5. Use nested (`sqrt(n)`) checkpointing for scanned stacks.
+6. Reduce per-device batch (and sequence length if needed).
+
+## 1) Shard More (Usually the First Lever)
+
+If arrays are accidentally replicated instead of partitioned, HBM disappears fast.
+
+Use explicit placement at boundaries:
+
+- `hax.shard(...)` for Haliax `NamedArray` trees.
+- `jax.device_put(...)` for explicit initial placement.
+- `jax.sharding.reshard(...)` when you need to change sharding mid-pipeline.
+- For LMs, explicitly shard output projection / vocab-axis tensors so logits are partitioned rather than replicated.
+
+```python
+import jax
+from jax.sharding import NamedSharding, PartitionSpec as P
+
+# Example: shard parameters across data/model axes instead of replicating.
+param_sharding = NamedSharding(mesh, P("data", "model"))
+params = jax.device_put(params, param_sharding)
+```
+
+For FSDP-style setups, confirm large parameter tensors are split across the data axis rather than replicated.
+In classic Levanter/Haliax codepaths, this is usually handled for you, but custom tensors and custom losses may still need explicit resharding.
+
+## 2) Activation Checkpointing and Activation Offloading
+
+Checkpointing (rematerialization) trades compute for memory by saving fewer intermediates in forward and recomputing them in backward.
+
+Activation offloading is a variant: selected activations are moved from device memory to pinned host memory after forward, then moved back before backward.
+
+Conceptually, with JAX checkpoint policies you choose, per named intermediate, whether to:
+
+- Save on device.
+- Offload to host.
+- Recompute.
+
+In Haliax/Levanter scanned stacks, this is typically exposed via `gradient_checkpointing` policies (e.g. standard recompute, offload variants, nested variants).
+
+References:
+
+- [JAX: Gradient checkpointing (`jax.checkpoint` / `jax.remat`)](https://docs.jax.dev/en/latest/gradient-checkpointing.html)
+- [JAX Memories and Host Offloading](https://docs.jax.dev/en/latest/notebooks/host-offloading.html)
+
+## 3) Explicit Offloading of Optimizer State (and Sometimes Params)
+
+Optimizer state is often one of the largest memory consumers (especially Adam-family optimizers).
+
+A common pattern is:
+
+1. Keep optimizer state in pinned host memory between steps.
+2. Bring it to device only for update math.
+3. Return updated state back to host.
+
+```python
+import jax
+import optax
+
+s_dev = params_sharding
+s_host = s_dev.with_memory_kind("pinned_host")
+opt_state = jax.device_put(opt_state, s_host)
+
+@jax.jit(donate_argnums=(0,), out_shardings=(s_dev, s_host))
+def train_step(params, opt_state, batch):
+    opt_state = jax.device_put(opt_state, s_dev)
+    grads = jax.grad(loss_fn)(params, batch)
+    updates, opt_state = optimizer.update(grads, opt_state, params)
+    params = optax.apply_updates(params, updates)
+    return params, jax.device_put(opt_state, s_host)
+```
+
+This usually buys substantial HBM headroom, at the cost of transfer bandwidth/latency.
+
+Reference:
+
+- [JAX Memories and Host Offloading (optimizer state + parameter offloading)](https://docs.jax.dev/en/latest/notebooks/host-offloading.html)
+
+## 4) Model Parallelism Can Beat "Max FSDP" in Some Regimes
+
+Sometimes parameter tensors or activations are too large even with aggressive data-axis sharding.
+In that case, giving devices to model/tensor parallel axes can reduce peak HBM even though it reduces FSDP degree.
+
+Rule of thumb: sweep a small grid of mesh shapes (for example, more `data` vs more `model`) and compare:
+
+- Peak HBM
+- Step time
+- Achievable global batch
+
+The best throughput-at-memory-budget point is often not the "maximum data parallel" point.
+
+## 5) `sqrt(n)` Checkpointing for Scanned Layer Stacks
+
+For a stack length `N`, nested checkpointing chunks the work into blocks of size `B` and stores only block boundaries.
+
+When `B ~= sqrt(N)`, memory for saved boundaries is `O(sqrt(N))` instead of `O(N)`, with recomputation overhead.
+
+This is useful for deep scanned stacks where plain checkpointing/offloading still does not fit.
+In Haliax scanned modules, nested checkpointing is available as a policy option.
+
+## 6) Reduce Per-Device Batch (and Sequence Length)
+
+If you are right at the limit:
+
+- Reduce microbatch/per-device batch.
+- If needed, reduce sequence length.
+- Recover global batch with gradient accumulation.
+
+These are the most direct and reliable HBM controls.
+
+## 7) Buffer Donation (`donate_argnums`)
+
+Donation lets JAX reuse input buffers for outputs at JIT boundaries, reducing peak live memory.
+
+Reference:
+
+- [JAX Buffer Donation](https://docs.jax.dev/en/latest/buffer_donation.html)
+
+## 8) Optimizer Choice Matters for Memory
+
+For equal parameter count, optimizer state memory can differ drastically.
+
+- Adam-like methods keep multiple full-size state tensors.
+- Memory-lean alternatives (where acceptable for your training regime) can materially reduce HBM pressure.
+
+If you keep Adam-family optimizers, offloading their state is often the practical compromise.
+
+## 9) Profile Memory Before and After Each Change
+
+Use JAX memory profiling tools to confirm what changed:
+
+- [JAX: Profiling device memory](https://docs.jax.dev/en/latest/device_memory_profiling.html)
+- [JAX: GPU memory allocation notes](https://docs.jax.dev/en/latest/gpu_memory_allocation.html)
+
+Memory tuning is much faster when each knob change is measured, not guessed.
+
+## 10) Avoid Giant Temporary Tensors
+
+Large temporaries can dominate peak memory even when parameter state fits.
+
+- Avoid materializing full-size intermediates when a fused/chunked computation exists.
+- For language models, the full logits tensor (`batch x seq x vocab`) is often the worst offender.
+- Use memory-efficient attention kernels/backends where available in your model stack.
+
+## 11) Keep EMA and Other Replicas Off HBM
+
+Extra full-parameter copies (for example EMA weights) can be expensive in HBM.
+
+- Keep long-lived replicas in host memory when possible.
+- Materialize them on-device only when needed (for eval/export windows).
+
+## 12) Use Lower Precision Where Safe
+
+HBM scales linearly with dtype size.
+
+- Prefer BF16 activations/weights on hardware where it is standard.
+- Be explicit about which states must remain FP32 (often optimizer moments), then offload those if needed.
+
+## 13) Tune Eval Memory Separately from Train
+
+Evaluation often has different memory pressure than training.
+
+- Set eval batch size independently.
+- Reduce concurrent eval tasks/checkpoints when needed.
+- Keep eval from overlapping peak-memory parts of training if your pipeline allows it.
diff --git a/docs/tutorials/local-gpu.md b/docs/tutorials/local-gpu.md
@@ -62,6 +62,8 @@ If you are using a DGX Spark or similar machine with unified memory, you may nee
     echo 'export XLA_PYTHON_CLIENT_MEM_FRACTION=0.5' >> ~/.bashrc
     ```
 
+    For broader JAX/Levanter memory tuning (sharding, checkpointing, offloading), see [Making Things Fit in HBM](../references/hbm-optimization.md).
+
 ## Running an Experiment
 
 Now you can run an experiment.
diff --git a/docs/tutorials/train-an-lm.md b/docs/tutorials/train-an-lm.md
@@ -97,6 +97,8 @@ Set up your training configuration by calculating the number of training steps a
     )
     ```
 
+If you hit HBM OOM while scaling model size, batch size, or sequence length, see [Making Things Fit in HBM](../references/hbm-optimization.md) for a practical tuning checklist.
+
 ## Creating the Training Pipeline
 
 Connect your model configuration, training parameters, and dataset to create a training pipeline:
diff --git a/experiments/grug/README.md b/experiments/grug/README.md
@@ -159,6 +159,7 @@ enforces these minimum interfaces:
 
 - Grug principles: [`/.agents/projects/grugformer.md`](../../.agents/projects/grugformer.md)
 - Change workflow: [`.agents/skills/change-grug/`](../../.agents/skills/change-grug/SKILL.md)
+- HBM/OOM tuning guide: [`/docs/references/hbm-optimization.md`](../../docs/references/hbm-optimization.md)
 - Executor mechanics: [`/docs/explanations/executor.md`](../../docs/explanations/executor.md)
 - Executor tutorial: [`/docs/tutorials/executor-101.md`](../../docs/tutorials/executor-101.md)
 - TPU debug workflow: [`/docs/dev-guide/dev_tpu.md`](../../docs/dev-guide/dev_tpu.md)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -97,6 +97,7 @@ nav:
       - Executor API: references/executor-api.md
       - Default Steps: references/default-steps.md
       - Training Configuration: references/train-config.md
+      - HBM Optimization: references/hbm-optimization.md
 
 markdown_extensions:
   - markdown.extensions.footnotes

Original file line number	Diff line number	Diff line change
`@@ -97,6 +97,8 @@ Set up your training configuration by calculating the number of training steps a`
`97`	`97`	`)`
`98`	`98`	```
`99`	`99`
	`100`	`+If you hit HBM OOM while scaling model size, batch size, or sequence length, see [Making Things Fit in HBM](../references/hbm-optimization.md) for a practical tuning checklist.`
	`101`	`+`
`100`	`102`	`## Creating the Training Pipeline`
`101`	`103`
`102`	`104`	`Connect your model configuration, training parameters, and dataset to create a training pipeline:`