grug/moe: restore aux-loss metrics and remove smoke launcher by dlwh · Pull Request #3229 · marin-community/marin

dlwh · 2026-03-04T05:31:53Z

Summary

restore MoE router auxiliary metrics/loss logging in experiments/grug/moe/model.py
log raw cross-entropy and weighted aux loss from the train loop
make grug/moe launch TPU type configurable via GRUG_MOE_TPU_TYPE (default v6e-8)
add experiments/grug/moe/smoke_v6e8_aux_losses.py for small aux-loss smoke launches
merge latest origin/main into this branch

Validation

./infra/pre-commit.py --all-files

Fixes #3196

…ss-pr

claude · 2026-03-04T05:32:08Z

Claude finished @dlwh's task in 1m 8s —— View job

Review

Read changed files and gather context
Analyze correctness and check for regressions
Post review

Restores router aux-loss metrics (load balancing, z-loss, entropy, per-layer histograms) through the MoE forward pass and train loop, adds a configurable TPU type for launches, and provides a smoke script. Fulfills the stated objectives.

Findings:

License header format (experiments/grug/moe/smoke_v6e8_aux_losses.py:1): Uses # Copyright 2025 The Marin Authors but etc/license_header.txt (updated in e986875) requires # Copyright The Marin Authors (no year). Pre-commit already flags this. Fix this →

No other bugs, correctness issues, or guideline violations found. The aux-loss computation (coefficient-weighted sum added to CE loss, with raw and weighted terms logged separately) follows the MaxText/Megatron pattern described in comments. Router stats (_routing_stats, _summarize_router_metrics) are numerically safe.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

dlwh · 2026-03-04T05:48:27Z

@dlwh-golem will you try out this branch on your local machine, modifying to be gpus=1 in launch.py and probably smaller batch. Just get it to run like 10 steps. change to be TinyStories like in the tutorial gpu run

dlwh-golem · 2026-03-04T05:53:29Z

I updated experiments/grug/moe/launch.py on this PR branch and pushed it:

Commit: 0f2e57e7b
Branch: codex/grug-moe-aux-loss-pr
PR: grug/moe: restore aux-loss metrics and remove smoke launcher #3229

What changed for the requested local smoke setup:

Switched to TinyStories (tutorial-style tokenization + data config).
Switched resources to single GPU: ResourceConfig.with_gpu("H100", count=1).
Reduced run size to smoke settings: steps=10, batch_size=4, shorter warmup, eval disabled.
Included tokenization step in executor_main(...) before the MoE trial step.

Run attempt:

Command used:
- UV_CACHE_DIR=/tmp/uv-cache HF_HOME=/tmp/hf-home HF_DATASETS_CACHE=/tmp/hf-home/datasets HUGGINGFACE_HUB_CACHE=/tmp/hf-home/hub WANDB_MODE=disabled uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smoke
Result:
- TinyStories tokenization completed successfully.
- Training step failed before step 1 in this environment with:
  - RuntimeError: No accelerator found. Please run on a TPU or GPU.
  - plus JAX warning: CUDA-enabled jaxlib not installed (fell back to CPU).

So the branch now matches the requested 1-GPU/TinyStories/10-step smoke config, but I couldn’t validate the 10 training steps in this runtime due missing CUDA JAX.

Codex work

Repository: marin-community/marin
Trigger: grug/moe: restore aux-loss metrics and remove smoke launcher #3229 (comment)
Request ID: 9de77c5d-b77d-4153-9dac-b5de377bcd24
Accepted: true

Codex transcript

Completed items:
- none captured

Transcript:
I updated `experiments/grug/moe/launch.py` on this PR branch and pushed it:

- Commit: `0f2e57e7b`  
- Branch: `codex/grug-moe-aux-loss-pr`  
- PR: https://github.com/marin-community/marin/pull/3229

What changed for the requested local smoke setup:
- Switched to TinyStories (tutorial-style tokenization + data config).
- Switched resources to single GPU: `ResourceConfig.with_gpu("H100", count=1)`.
- Reduced run size to smoke settings: `steps=10`, `batch_size=4`, shorter warmup, eval disabled.
- Included tokenization step in `executor_main(...)` before the MoE trial step.

Run attempt:
- Command used:
  - `UV_CACHE_DIR=/tmp/uv-cache HF_HOME=/tmp/hf-home HF_DATASETS_CACHE=/tmp/hf-home/datasets HUGGINGFACE_HUB_CACHE=/tmp/hf-home/hub WANDB_MODE=disabled uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smoke`
- Result:
  - TinyStories tokenization completed successfully.
  - Training step failed before step 1 in this environment with:
    - `RuntimeError: No accelerator found. Please run on a TPU or GPU.`
    - plus JAX warning: CUDA-enabled `jaxlib` not installed (fell back to CPU).  

So the branch now matches the requested 1-GPU/TinyStories/10-step smoke config, but I couldn’t validate the 10 training steps in this runtime due missing CUDA JAX.

dlwh · 2026-03-04T05:54:24Z

@dlwh-golem you need to install the gpu packages. follow the tutorial for gpus

dlwh-golem · 2026-03-04T06:02:06Z

Followed the GPU tutorial setup path and installed the GPU Python stack:

Ran: uv sync --all-packages --extra=gpu
This installed CUDA JAX plugin packages (jax-cuda12-plugin, jax-cuda12-pjrt) plus CUDA wheels (nvidia-cudnn-cu12, nvidia-cublas-cu12, etc.).

Then I retried the PR smoke run:

uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smoke
with HF_HOME/HF_DATASETS_CACHE/HUGGINGFACE_HUB_CACHE set under /tmp.

Result:

TinyStories tokenization step succeeds.
Training still fails before step 1 with:
- cuInit(0) failed: CUDA_ERROR_OPERATING_SYSTEM
- then RuntimeError: No accelerator found. Please run on a TPU or GPU.

Quick check on this worker also shows nvidia-smi -> Failed to initialize NVML: Unknown Error, so this looks like host/runtime GPU access, not missing Python packages anymore.

If you want, next I can run the same command outside sandbox on the host runtime (or on a different GPU host) to confirm it reaches 10 steps.

Codex work

Repository: marin-community/marin
Trigger: grug/moe: restore aux-loss metrics and remove smoke launcher #3229 (comment)
Request ID: 947ba31e-e687-4e32-b3c1-dc72fe12e966
Accepted: true

Codex transcript

Completed items:
- none captured

Transcript:
Followed the GPU tutorial setup path and installed the GPU Python stack:

- Ran: `uv sync --all-packages --extra=gpu`
- This installed CUDA JAX plugin packages (`jax-cuda12-plugin`, `jax-cuda12-pjrt`) plus CUDA wheels (`nvidia-cudnn-cu12`, `nvidia-cublas-cu12`, etc.).

Then I retried the PR smoke run:

- `uv run python experiments/grug/moe/launch.py --prefix /tmp/grug-moe-local-smoke`  
  with `HF_HOME/HF_DATASETS_CACHE/HUGGINGFACE_HUB_CACHE` set under `/tmp`.

Result:
- TinyStories tokenization step succeeds.
- Training still fails before step 1 with:
  - `cuInit(0) failed: CUDA_ERROR_OPERATING_SYSTEM`
  - then `RuntimeError: No accelerator found. Please run on a TPU or GPU.`

Quick check on this worker also shows `nvidia-smi` -> `Failed to initialize NVML: Unknown Error`, so this looks like host/runtime GPU access, not missing Python packages anymore.

If you want, next I can run the same command outside sandbox on the host runtime (or on a different GPU host) to confirm it reaches 10 steps.

dlwh · 2026-03-04T06:28:20Z

@dlwh-golem try one more time. fixed some config on your end

dlwh-golem · 2026-03-04T06:28:32Z