fix(examples): unblock Megatron TP notebook on GPU E2E (#3434)

XploY04 · web-flow · commit 71335f7d304c · 2026-04-17T13:01:48.000Z
* fix(e2e): bump Megatron notebook Complete-wait timeout to 10m

The Megatron TP notebook waits for the TrainJob to reach Complete with
timeout=120. On the oracle-vm-gpu-a10-1 CI runner the happy-path time
from TrainJob creation to Complete is ~96s (measured on the last passing
GPU E2E run, 2026-04-15). Any runner slowdown, image-pull delay, or
GPU-advertisement latency on top of that pushes the test past the 120s
budget and papermill raises TimeoutError, even though the TrainJob is
still on track to finish.

Every GPU E2E run has been failing here since 2026-04-16 ~16:00 UTC on
every branch, with no functional repo change between the last pass and
first fail. Bumping to 600s gives enough headroom for cold image pulls
and scheduling variance without masking real failures (papermill's
outer --execution-timeout is 1800s).

Signed-off-by: XploY04 &lt;2004agarwalyash@gmail.com&gt;

* fix(examples): update Megatron tokenizer library to null-text for 0.17.0

megatron-core 0.17.0 (published 2026-04-16 20:22 UTC) tightened the set
of accepted tokenizer library names in MegatronTokenizer.from_pretrained.
The bare "null" value is no longer accepted; the null-tokenizer keys are
now "null-text" and "null-multimodal":

  0.16.1: if library not in ['byte-level', 'null']: assert tokenizer_path
  0.17.0: if library not in ['byte-level', 'null-text', 'null-multimodal']:
              assert tokenizer_path

The notebook's call with metadata_path={"library": "null"} therefore
triggers `AssertionError: Tokenizer path must be specified.` on the GPU
E2E runner, which now installs 0.17.0 by default. Renaming to
"null-text" routes through the same NullTokenizer(vocab_size) library
class the old "null" key used, so behavior is unchanged on 0.17.0.

This explains why GPU E2E ran green on 2026-04-15 (installed 0.16.1)
and started failing with the tokenizer assertion once 0.17.0 landed on
PyPI. The previous commit's 120s -&gt; 600s wait timeout bump is what made
this assertion visible; with the old 120s budget the notebook timed out
before the TrainJob reached Failed and the real error was hidden.

Signed-off-by: XploY04 &lt;2004agarwalyash@gmail.com&gt;

---------

Signed-off-by: XploY04 &lt;2004agarwalyash@gmail.com&gt;
diff --git a/examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb b/examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb
@@ -180,7 +180,7 @@
     "        reset_attention_mask=False,\n",
     "        eod_mask_loss=False,\n",
     "        tokenizer=MegatronTokenizer.from_pretrained(\n",
-    "            metadata_path={\"library\": \"null\"},\n",
+    "            metadata_path={\"library\": \"null-text\"},\n",
     "            vocab_size=_SEQUENCE_LENGTH,\n",
     "        ),\n",
     "        mid_level_dataset_surplus=0.005,\n",
@@ -397,7 +397,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "client.wait_for_job_status(name=job_name, timeout=120)"
+    "client.wait_for_job_status(name=job_name, timeout=600)"
    ]
   },
   {