[grug] Add MoE ragged debug launch artifacts

dlwh · dlwh · commit 5afc8831aa0c · 2026-04-17T14:24:22.000-07:00
diff --git a/docs/debug-log-ragged-all-to-all.md b/docs/debug-log-ragged-all-to-all.md
@@ -0,0 +1,63 @@
+# Debugging log for ragged-all-to-all
+
+Investigating why the 1e23 Grug MoE run diverges under `ragged_all_to_all` while the ring configuration is healthy.
+
+## Initial status
+
+The `moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_011404` run started with healthy step-0 metrics, then diverged later in training. By step 1000 it was far behind the old `ep4_ring` baseline and by step 1250 it showed multi-million gradient norms, followed by `NaN` eval at step 1259.
+
+## Hypothesis 1
+
+The ragged dispatch path is not semantically equivalent to the ring path under expert parallelism and later-training router distributions. Capacity clipping, dropped-assignment accounting, or recombination may diverge in a way that only appears once routing sharpens.
+
+## Changes to make
+
+- Relaunch the 1e23 config as ring `ep4` on the current Ray cluster for a fresh control run.
+- Audit `experiments/grug/moe/model.py` and `lib/levanter/src/levanter/grug/grug_moe.py`.
+- Run targeted TPU tests on `v5p-8` and `v5p-32` to compare ring vs ragged gradients for the functional MoE MLP block.
+
+## Future Work
+
+- [ ] Check whether ring and ragged produce materially different MLP block gradients on TPU pods.
+- [ ] Confirm whether dropped-assignment behavior differs between implementations at equal capacity.
+- [ ] Compare router-sharpening behavior after a few hundred optimization steps, not just at initialization.
+
+## Results
+
+- The host-side routing simulation matches ring exactly, so the abstract clip/permute/unpermute math was not the bug.
+- The actual bug was in [grug_moe.py](/Users/dlwh/.codex/worktrees/8989/marin/lib/levanter/src/levanter/grug/grug_moe.py): `_shard_a2a_params` was feeding `jax.lax.ragged_all_to_all` receiver-local output offsets instead of the sender-side remote offsets that the primitive expects.
+  - JAX internally transposes `output_offsets` with an `all_to_all`.
+  - Our old code pre-transposed them, so real distributed runs wrote returned slices into the wrong positions.
+  - That explains why a pure Python/JAX simulation of the routing logic looked correct while TPU runs showed large ring-vs-ragged output and gradient deltas.
+- Existing EP coverage in [test_grugformer_moe.py](/Users/dlwh/.codex/worktrees/8989/marin/lib/levanter/tests/grug/test_grugformer_moe.py) only checks output shape and finiteness, not ring-vs-ragged parity.
+- Fresh ring control relaunch:
+  - Ray submission: `ray-run-dlwh-moe-uvtpu-ep4-ring-manual-20260417_152005`
+  - W&B run id (expected once training initializes): `moe_1e23_d5120_bs2048_ep4_ring_rayuvtpu_20260417_152005`
+  - The first relaunch attempt via `ray_run.py` failed during Ray runtime-env pip setup because `kitoken==0.10.2` was not available through the cluster-visible pip indexes.
+  - Manual `ray job submit` without Ray pip runtime-env is now running and has reached executor dispatch for `grug/moe_1e23_d5120_bs2048_ep4_ring`.
+- TPU parity probes:
+  - Initial `v5p-8` and `v5p-32` attempts failed because the probe lived under untracked `scratch/`, which the Iris workspace bundle did not include.
+  - The probe now lives at [scripts/debug/grug_moe_grad_compare.py](/Users/dlwh/.codex/worktrees/8989/marin/scripts/debug/grug_moe_grad_compare.py) and compiles locally.
+  - First tracked-path jobs submitted:
+    - `/dlwh/grug-moe-grad-compare-v5p8-20260417-0828`
+    - `/dlwh/grug-moe-grad-compare-v5p32-20260417-0828`
+  - Those jobs later failed with entrypoint container OOM (`exit 137`) under the default `1GB` host-memory request, so they did not yet exercise the MoE kernel.
+  - Region-widened jobs were submitted next:
+    - `/dlwh/grug-moe-grad-compare-v5p8-20260417-0833`
+    - `/dlwh/grug-moe-grad-compare-v5p32-20260417-0833`
+  - Final corrected jobs use both `us-central1` and `us-east5` plus `--memory 8GB`:
+    - `/dlwh/grug-moe-grad-compare-v5p8-20260417-0835`
+    - `/dlwh/grug-moe-grad-compare-v5p32-20260417-0835`
+  - Successful `v5p-8` probe after the `_shard_a2a_params` fix:
+    - Job: `/dlwh/grug-moe-grad-compare-v5p8-20260417-094052`
+    - Normal routed case now matches:
+      - `ring_loss == ragged_loss == 995541.4375`
+      - `ring_dropped == ragged_dropped == 9`
+      - `output_diff.rel_l2 = 4.17e-08`
+      - `grad_x_diff.rel_l2 = 4.20e-08`
+      - `grad_w_up_gate_diff = 0`
+      - `grad_w_down_diff = 0`
+    - Forced-overflow case still matches exactly with zero diffs.
+  - Local regression coverage added:
+    - `_shard_a2a_params` now has a unit test asserting sender-side output offsets.
+    - A parity test now checks `ring` vs `ragged_all_to_all` MoE outputs when EP is available on a non-CPU backend.
diff --git a/experiments/grug/moe/launch.py b/experiments/grug/moe/launch.py
@@ -119,7 +119,7 @@ def run_grug_moe_trial(config: GrugMoeLaunchConfig) -> None:
     run_grug(run_config)
 
 
-RESOLVED_RUN_ID = _resolve_run_id("moe_1e23_d5120_bs2048_ep8_ragged")
+RESOLVED_RUN_ID = _resolve_run_id("moe_1e23_d5120_bs2048_ep4_ring")
 
 
 # 1e23 compute budget, d5120. Model +
@@ -129,18 +129,19 @@ def run_grug_moe_trial(config: GrugMoeLaunchConfig) -> None:
 _BASELINE_BUDGET: float = 1e23
 _BASELINE_HIDDEN_DIM: int = 5120
 _BASELINE_TARGET_STEPS: int = 120_000
+_BASELINE_NUM_LAYERS_OVERRIDE: int | None = 48
 _baseline_model, _baseline_optimizer, _baseline_batch, _baseline_steps = build_from_heuristic(
     budget=_BASELINE_BUDGET,
     hidden_dim=_BASELINE_HIDDEN_DIM,
     target_steps=_BASELINE_TARGET_STEPS,
 )
-# Stack MoE blocks via jax.lax.scan to keep XLA compile + peak HBM tractable at
-# the heuristic-derived depth, and force ragged dispatch so the smoke exercises
-# the high-EP path from #4697.
+# Match the known-good 1e23 ring EP=4 configuration while keeping the current
+# v4-2048/us-central2 launch wiring.
 _baseline_model = dataclasses.replace(
     _baseline_model,
-    moe_implementation="ragged_all_to_all",
+    moe_implementation="ring",
     use_array_stacked_blocks=True,
+    num_layers=_BASELINE_NUM_LAYERS_OVERRIDE or _baseline_model.num_layers,
 )
 
 # Override the heuristic-derived batch_size (round_up_pow2 only produces powers
@@ -157,7 +158,7 @@ def run_grug_moe_trial(config: GrugMoeLaunchConfig) -> None:
 
 
 baseline_moe = ExecutorStep(
-    name="grug/moe_1e23_d5120_bs2048_ep8_ragged",
+    name="grug/moe_1e23_d5120_bs2048_ep4_ring",
     fn=run_grug_moe_trial,
     config=GrugMoeLaunchConfig(
         model=versioned(_baseline_model),
@@ -169,7 +170,7 @@ def run_grug_moe_trial(config: GrugMoeLaunchConfig) -> None:
         resources=versioned(ResourceConfig.with_tpu("v4-2048", regions=["us-central2"])),
         steps=versioned(_baseline_steps),
         batch_size=versioned(_baseline_batch),
-        expert_parallel=versioned(8),
+        expert_parallel=versioned(4),
         seed=versioned(0),
         mp=versioned("params=float32,compute=bfloat16,output=bfloat16"),
         tracker=WandbConfig(
diff --git a/scripts/debug/grug_moe_grad_compare.py b/scripts/debug/grug_moe_grad_compare.py
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+# Copyright The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import json
+
+import numpy as np
+
+import jax
+import jax.numpy as jnp
+from jax.experimental import multihost_utils
+from jax.sharding import AxisType, Mesh, NamedSharding, PartitionSpec as P
+
+from iris.runtime.jax_init import initialize_jax
+from levanter.grug.grug_moe import moe_mlp
+from levanter.utils.activation import ActivationFunctionEnum
+
+
+def _make_ep_mesh() -> Mesh:
+    devices = jax.devices()
+    if len(devices) < 2 or len(devices) % 2 != 0:
+        raise RuntimeError(f"Need an even number of devices >= 2, got {len(devices)}")
+    mesh_devices = np.array(devices).reshape(len(devices) // 2, 2, 1)
+    return Mesh(
+        mesh_devices,
+        axis_names=("data", "expert", "model"),
+        axis_types=(AxisType.Explicit, AxisType.Explicit, AxisType.Explicit),
+    )
+
+
+def _make_inputs(
+    *,
+    key: jax.Array,
+    tokens: int,
+    hidden_dim: int,
+    intermediate_dim: int,
+    num_experts: int,
+    topk: int,
+    overflow: bool,
+) -> tuple[jax.Array, jax.Array, jax.Array, jax.Array, jax.Array]:
+    k_x, k_sel, k_logits, k_w13, k_w2 = jax.random.split(key, 5)
+    x = jax.random.normal(k_x, (tokens, hidden_dim), dtype=jnp.float32)
+    if overflow:
+        selected_experts = jnp.zeros((tokens, topk), dtype=jnp.int32)
+        combine_weights = jnp.full((tokens, topk), 1.0 / topk, dtype=jnp.float32)
+    else:
+        selected_experts = jax.random.randint(k_sel, (tokens, topk), 0, num_experts, dtype=jnp.int32)
+        combine_logits = jax.random.normal(k_logits, (tokens, topk), dtype=jnp.float32)
+        combine_weights = jax.nn.softmax(combine_logits, axis=-1)
+    w_up_gate = jax.random.normal(k_w13, (num_experts, hidden_dim, 2 * intermediate_dim), dtype=jnp.float32)
+    w_down = jax.random.normal(k_w2, (num_experts, intermediate_dim, hidden_dim), dtype=jnp.float32)
+    return x, selected_experts, combine_weights, w_up_gate, w_down
+
+
+def _tree_diff_stats(a, b) -> dict[str, float]:
+    leaves_a = jax.tree.leaves(a)
+    leaves_b = jax.tree.leaves(b)
+    max_abs = 0.0
+    max_rel = 0.0
+    l2_sq = 0.0
+    ref_l2_sq = 0.0
+    for xa, xb in zip(leaves_a, leaves_b, strict=True):
+        da = np.asarray(xa)
+        db = np.asarray(xb)
+        diff = np.abs(da - db)
+        max_abs = max(max_abs, float(diff.max(initial=0.0)))
+        denom = np.maximum(np.abs(db), 1e-12)
+        max_rel = max(max_rel, float((diff / denom).max(initial=0.0)))
+        l2_sq += float(np.sum((da - db) ** 2))
+        ref_l2_sq += float(np.sum(db**2))
+    return {
+        "max_abs": max_abs,
+        "max_rel": max_rel,
+        "l2": l2_sq**0.5,
+        "ref_l2": ref_l2_sq**0.5,
+        "rel_l2": (l2_sq**0.5) / max(ref_l2_sq**0.5, 1e-12),
+    }
+
+
+def _host_array(x: jax.Array) -> np.ndarray:
+    if jax.process_count() > 1 and getattr(x, "ndim", 0) > 0:
+        x = multihost_utils.process_allgather(x, tiled=True)
+    return np.asarray(x)
+
+
+def _host_scalar(x: jax.Array) -> float:
+    return float(np.asarray(x))
+
+
+def _run_case(mesh: Mesh, *, overflow: bool) -> dict[str, object]:
+    hidden_dim = 128
+    intermediate_dim = 256
+    num_experts = 8
+    topk = 4
+    tokens = max(len(jax.devices()) * 16, 64)
+
+    with jax.set_mesh(mesh):
+        x, selected_experts, combine_weights, w_up_gate, w_down = _make_inputs(
+            key=jax.random.key(17 if overflow else 7),
+            tokens=tokens,
+            hidden_dim=hidden_dim,
+            intermediate_dim=intermediate_dim,
+            num_experts=num_experts,
+            topk=topk,
+            overflow=overflow,
+        )
+
+        batch_sharding = NamedSharding(mesh, P(("data", "expert"), None))
+        expert_sharding = NamedSharding(mesh, P("expert", None, None))
+        x = jax.sharding.reshard(x, batch_sharding)
+        selected_experts = jax.sharding.reshard(selected_experts, batch_sharding)
+        combine_weights = jax.sharding.reshard(combine_weights, batch_sharding)
+        w_up_gate = jax.sharding.reshard(w_up_gate, expert_sharding)
+        w_down = jax.sharding.reshard(w_down, expert_sharding)
+
+        def run_impl(implementation: str):
+            def loss_and_drop(
+                x_arg,
+                selected_experts_arg,
+                combine_weights_arg,
+                w_up_gate_arg,
+                w_down_arg,
+            ):
+                out, dropped = moe_mlp(
+                    x_arg,
+                    selected_experts_arg,
+                    combine_weights_arg,
+                    w_up_gate_arg,
+                    w_down_arg,
+                    activation=ActivationFunctionEnum.silu,
+                    implementation=implementation,
+                    mesh=None,
+                    report_capacity_overflow=True,
+                    capacity_factor=1.0,
+                )
+                loss = jnp.mean(out.astype(jnp.float32) ** 2)
+                return loss, (out, dropped)
+
+            fn = jax.jit(jax.value_and_grad(loss_and_drop, has_aux=True, argnums=(0, 3, 4)))
+            (loss, (out, dropped)), grads = fn(x, selected_experts, combine_weights, w_up_gate, w_down)
+            return loss, out, dropped, grads
+
+        ring_loss, ring_out, ring_dropped, ring_grads = run_impl("ring")
+        ragged_loss, ragged_out, ragged_dropped, ragged_grads = run_impl("ragged_all_to_all")
+
+        ring_loss = _host_scalar(ring_loss)
+        ragged_loss = _host_scalar(ragged_loss)
+        ring_out_np = _host_array(ring_out)
+        ragged_out_np = _host_array(ragged_out)
+        ring_grad_x = _host_array(ring_grads[0])
+        ragged_grad_x = _host_array(ragged_grads[0])
+        ring_grad_w_up_gate = _host_array(ring_grads[1])
+        ragged_grad_w_up_gate = _host_array(ragged_grads[1])
+        ring_grad_w_down = _host_array(ring_grads[2])
+        ragged_grad_w_down = _host_array(ragged_grads[2])
+
+        return {
+            "overflow": overflow,
+            "tokens": tokens,
+            "num_devices": len(jax.devices()),
+            "num_processes": jax.process_count(),
+            "ring_loss": ring_loss,
+            "ragged_loss": ragged_loss,
+            "loss_delta": ring_loss - ragged_loss,
+            "ring_dropped": int(np.asarray(ring_dropped)),
+            "ragged_dropped": int(np.asarray(ragged_dropped)),
+            "output_diff": _tree_diff_stats(ring_out_np, ragged_out_np),
+            "grad_x_diff": _tree_diff_stats(ring_grad_x, ragged_grad_x),
+            "grad_w_up_gate_diff": _tree_diff_stats(ring_grad_w_up_gate, ragged_grad_w_up_gate),
+            "grad_w_down_diff": _tree_diff_stats(ring_grad_w_down, ragged_grad_w_down),
+        }
+
+
+def main() -> None:
+    initialize_jax()
+    mesh = _make_ep_mesh()
+    normal = _run_case(mesh, overflow=False)
+    overflow = _run_case(mesh, overflow=True)
+    if jax.process_index() == 0:
+        print(json.dumps({"normal": normal, "overflow": overflow}, indent=2, sort_keys=True), flush=True)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/debug/launch_grug_moe_ep8_ragged_48l_fix.py b/scripts/debug/launch_grug_moe_ep8_ragged_48l_fix.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python3
+# Copyright The Marin Authors
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import dataclasses
+
+from experiments.grug.moe.launch import (
+    ExecutorStep,
+    GrugEvalConfig,
+    GrugMoeLaunchConfig,
+    GrugTrainerConfig,
+    NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
+    WandbConfig,
+    _baseline_batch,
+    _baseline_model,
+    _baseline_optimizer,
+    _baseline_steps,
+    _resolve_run_id,
+    executor_main,
+    run_grug_moe_trial,
+    this_output_path,
+    versioned,
+)
+from fray.cluster import ResourceConfig
+
+RUN_ID = _resolve_run_id("moe_1e23_d5120_bs2048_ep8_ragged_48l_rayuvtpu_20260417_0945")
+STEP_NAME = "grug/moe_1e23_d5120_bs2048_ep8_ragged_48l_fix_a2a_20260417_0945"
+
+
+ragged_ep8_fix = ExecutorStep(
+    name=STEP_NAME,
+    fn=run_grug_moe_trial,
+    config=GrugMoeLaunchConfig(
+        model=versioned(
+            dataclasses.replace(
+                _baseline_model,
+                moe_implementation="ragged_all_to_all",
+                use_array_stacked_blocks=True,
+                num_layers=48,
+            )
+        ),
+        data=NEMOTRON_MIX_WITH_DEFAULT_VALIDATION,
+        output_path=this_output_path(),
+        run_id=RUN_ID,
+        resources=versioned(ResourceConfig.with_tpu("v4-2048", regions=["us-central2"])),
+        steps=versioned(_baseline_steps),
+        batch_size=versioned(_baseline_batch),
+        expert_parallel=versioned(8),
+        seed=versioned(0),
+        mp=versioned("params=float32,compute=bfloat16,output=bfloat16"),
+        tracker=WandbConfig(
+            project="dial_moe",
+            tags=["adamh", "qb", "sharded-qb", "gatednorm", "xsa", "zloss", "eq3e3", "ragged-fix"],
+            group="moe-iter04",
+            name=None,
+        ),
+        optimizer=versioned(_baseline_optimizer),
+        priority_band="production",
+        grug_trainer=versioned(
+            GrugTrainerConfig(
+                z_loss_weight=1e-4,
+                ema_beta=None,
+                log_every=1,
+            )
+        ),
+        eval=versioned(
+            GrugEvalConfig(
+                eval_batch_size=1024,
+                steps_per_eval=1000,
+                max_eval_batches=8,
+                eval_current=True,
+                eval_ema=False,
+            )
+        ),
+    ),
+)
+
+
+if __name__ == "__main__":
+    executor_main(
+        steps=[ragged_ep8_fix],
+        description="Grug MoE 1e23 ragged EP8 relaunch after ragged_all_to_all offset fix.",
+    )