Skip to content

Commit 71335f7

Browse files
authored
fix(examples): unblock Megatron TP notebook on GPU E2E (#3434)
* fix(e2e): bump Megatron notebook Complete-wait timeout to 10m The Megatron TP notebook waits for the TrainJob to reach Complete with timeout=120. On the oracle-vm-gpu-a10-1 CI runner the happy-path time from TrainJob creation to Complete is ~96s (measured on the last passing GPU E2E run, 2026-04-15). Any runner slowdown, image-pull delay, or GPU-advertisement latency on top of that pushes the test past the 120s budget and papermill raises TimeoutError, even though the TrainJob is still on track to finish. Every GPU E2E run has been failing here since 2026-04-16 ~16:00 UTC on every branch, with no functional repo change between the last pass and first fail. Bumping to 600s gives enough headroom for cold image pulls and scheduling variance without masking real failures (papermill's outer --execution-timeout is 1800s). Signed-off-by: XploY04 <2004agarwalyash@gmail.com> * fix(examples): update Megatron tokenizer library to null-text for 0.17.0 megatron-core 0.17.0 (published 2026-04-16 20:22 UTC) tightened the set of accepted tokenizer library names in MegatronTokenizer.from_pretrained. The bare "null" value is no longer accepted; the null-tokenizer keys are now "null-text" and "null-multimodal": 0.16.1: if library not in ['byte-level', 'null']: assert tokenizer_path 0.17.0: if library not in ['byte-level', 'null-text', 'null-multimodal']: assert tokenizer_path The notebook's call with metadata_path={"library": "null"} therefore triggers `AssertionError: Tokenizer path must be specified.` on the GPU E2E runner, which now installs 0.17.0 by default. Renaming to "null-text" routes through the same NullTokenizer(vocab_size) library class the old "null" key used, so behavior is unchanged on 0.17.0. This explains why GPU E2E ran green on 2026-04-15 (installed 0.16.1) and started failing with the tokenizer assertion once 0.17.0 landed on PyPI. The previous commit's 120s -> 600s wait timeout bump is what made this assertion visible; with the old 120s budget the notebook timed out before the TrainJob reached Failed and the real error was hidden. Signed-off-by: XploY04 <2004agarwalyash@gmail.com> --------- Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
1 parent 9c598fb commit 71335f7

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

examples/megatron/tensor-parallelism/megatron-core-gpt-tp.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@
180180
" reset_attention_mask=False,\n",
181181
" eod_mask_loss=False,\n",
182182
" tokenizer=MegatronTokenizer.from_pretrained(\n",
183-
" metadata_path={\"library\": \"null\"},\n",
183+
" metadata_path={\"library\": \"null-text\"},\n",
184184
" vocab_size=_SEQUENCE_LENGTH,\n",
185185
" ),\n",
186186
" mid_level_dataset_surplus=0.005,\n",
@@ -397,7 +397,7 @@
397397
"metadata": {},
398398
"outputs": [],
399399
"source": [
400-
"client.wait_for_job_status(name=job_name, timeout=120)"
400+
"client.wait_for_job_status(name=job_name, timeout=600)"
401401
]
402402
},
403403
{

0 commit comments

Comments
 (0)