-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
hey, im trying to run GPT-2 pretraining using the nemo 25.11 container with megatron and i keep getting this error no matter what flags i pass. I've tried different combos of --tokenizer-type, --dataloader-type, different TP/PP configs, nothing works. every single run crashes at the same place.
im running on 4x L40 gpus using apptainer and training on wikitext-103.
Error
[rank0]: File "/opt/megatron-lm/megatron/training/utils.py", line 562, in get_batch_on_this_tp_rank
[rank0]: _broadcast_cu_seqlens(batch['cu_seqlens'])
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: UnboundLocalError: cannot access local variable '_broadcast_cu_seqlens' where it is not associated with a value
all 4 ranks crash with the same error
How to reproduce
torchrun --nproc_per_node=4 /opt/megatron-lm/pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 2 \
--global-batch-size 16 \
--train-iters 50 \
--lr 0.0001 \
--min-lr 0.00001 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--bf16 \
--tokenizer-type GPT2BPETokenizer \
--dataloader-type single \
--data-path /path/to/gpt2_wikitext103_text_document \
--vocab-file /path/to/gpt2-vocab.json \
--merge-file /path/to/gpt2-merges.txt \
--split 949,50,1Environment
- Container:
nvcr.io/nvidia/nemo:25.11(build 259213668) - GPUs: 4x NVIDIA L40
- Runtime: Apptainer
What I think might be going on
So I was working through this with Claude and it pointed out what looks like a python scoping bug in get_batch_on_this_tp_rank() in megatron/training/utils.py. I'm not 100% sure if this is actually a bug or if I'm just passing the wrong arguments, but here's what Claude found:
The function has two branches — an if data is not None branch (for the TP rank that has the data) and an else branch (for ranks that receive data via broadcast).
In the else branch at line 620, there's a local function definition:
def _broadcast_cu_seqlens(): # no arguments, creates empty tensors to receive broadcast
...But in the if branch at lines 562 and 569, the code calls:
_broadcast_cu_seqlens(batch['cu_seqlens']) # with an argumentClaude explained that because Python sees the def at line 620, it treats _broadcast_cu_seqlens as a local variable for the entire enclosing function. So when the if branch runs first and tries to call it at line 562, it's marked as local but hasn't been assigned yet → UnboundLocalError.
If this is right, then no CLI flags would fix it because every TP source rank takes the if branch, and with TP=1 that's every rank. The batch always has a cu_seqlens key and the call is unconditional.
Possible fix
Claude suggested replacing lines 562 and 569 with _broadcast(batch['cu_seqlens']) to match the pattern used for every other tensor in that branch:
if args.pipeline_model_parallel_size == 1 or mtp_on_this_rank:
_broadcast(batch['tokens'])
_broadcast(batch['labels'])
_broadcast(batch['loss_mask'])
_broadcast(batch['attention_mask'])
_broadcast(batch['position_ids'])
- _broadcast_cu_seqlens(batch['cu_seqlens'])
+ _broadcast(batch['cu_seqlens'])
_broadcast(batch['max_seqlen'])
elif mpu.is_pipeline_first_stage():
_broadcast(batch['tokens'])
_broadcast(batch['attention_mask'])
_broadcast(batch['position_ids'])
- _broadcast_cu_seqlens(batch['cu_seqlens'])
+ _broadcast(batch['cu_seqlens'])
_broadcast(batch['max_seqlen'])Again not sure if this is the right fix or if im just doing something wrong on my end. Any guidance appreciated!
Workaround (for now)
Patching the file and bind-mounting it over the container version gets training running:
apptainer run --nv nemo.sif \
cat /opt/megatron-lm/megatron/training/utils.py > utils_patched.py
sed -i 's/_broadcast_cu_seqlens(batch\[.cu_seqlens.\])/_broadcast(batch["cu_seqlens"])/g' \
utils_patched.py
apptainer run --nv \
--bind utils_patched.py:/opt/megatron-lm/megatron/training/utils.py \
nemo.sif \
python train.py