Skip to content

UnboundLocalError: _broadcast_cu_seqlens when running pretrain_gpt.py in nemo:25.11 container #3848

@Wafik20

Description

@Wafik20

hey, im trying to run GPT-2 pretraining using the nemo 25.11 container with megatron and i keep getting this error no matter what flags i pass. I've tried different combos of --tokenizer-type, --dataloader-type, different TP/PP configs, nothing works. every single run crashes at the same place.

im running on 4x L40 gpus using apptainer and training on wikitext-103.

Error

[rank0]: File "/opt/megatron-lm/megatron/training/utils.py", line 562, in get_batch_on_this_tp_rank
[rank0]:     _broadcast_cu_seqlens(batch['cu_seqlens'])
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^
[rank0]: UnboundLocalError: cannot access local variable '_broadcast_cu_seqlens' where it is not associated with a value

all 4 ranks crash with the same error

How to reproduce

torchrun --nproc_per_node=4 /opt/megatron-lm/pretrain_gpt.py \
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
    --num-layers 12 \
    --hidden-size 768 \
    --num-attention-heads 12 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 2 \
    --global-batch-size 16 \
    --train-iters 50 \
    --lr 0.0001 \
    --min-lr 0.00001 \
    --lr-decay-style cosine \
    --lr-warmup-fraction 0.01 \
    --clip-grad 1.0 \
    --weight-decay 0.1 \
    --bf16 \
    --tokenizer-type GPT2BPETokenizer \
    --dataloader-type single \
    --data-path /path/to/gpt2_wikitext103_text_document \
    --vocab-file /path/to/gpt2-vocab.json \
    --merge-file /path/to/gpt2-merges.txt \
    --split 949,50,1

Environment

  • Container: nvcr.io/nvidia/nemo:25.11 (build 259213668)
  • GPUs: 4x NVIDIA L40
  • Runtime: Apptainer

What I think might be going on

So I was working through this with Claude and it pointed out what looks like a python scoping bug in get_batch_on_this_tp_rank() in megatron/training/utils.py. I'm not 100% sure if this is actually a bug or if I'm just passing the wrong arguments, but here's what Claude found:

The function has two branches — an if data is not None branch (for the TP rank that has the data) and an else branch (for ranks that receive data via broadcast).

In the else branch at line 620, there's a local function definition:

def _broadcast_cu_seqlens():   # no arguments, creates empty tensors to receive broadcast
    ...

But in the if branch at lines 562 and 569, the code calls:

_broadcast_cu_seqlens(batch['cu_seqlens'])   # with an argument

Claude explained that because Python sees the def at line 620, it treats _broadcast_cu_seqlens as a local variable for the entire enclosing function. So when the if branch runs first and tries to call it at line 562, it's marked as local but hasn't been assigned yet → UnboundLocalError.

If this is right, then no CLI flags would fix it because every TP source rank takes the if branch, and with TP=1 that's every rank. The batch always has a cu_seqlens key and the call is unconditional.

Possible fix

Claude suggested replacing lines 562 and 569 with _broadcast(batch['cu_seqlens']) to match the pattern used for every other tensor in that branch:

         if args.pipeline_model_parallel_size == 1 or mtp_on_this_rank:
             _broadcast(batch['tokens'])
             _broadcast(batch['labels'])
             _broadcast(batch['loss_mask'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
-            _broadcast_cu_seqlens(batch['cu_seqlens'])
+            _broadcast(batch['cu_seqlens'])
             _broadcast(batch['max_seqlen'])

         elif mpu.is_pipeline_first_stage():
             _broadcast(batch['tokens'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
-            _broadcast_cu_seqlens(batch['cu_seqlens'])
+            _broadcast(batch['cu_seqlens'])
             _broadcast(batch['max_seqlen'])

Again not sure if this is the right fix or if im just doing something wrong on my end. Any guidance appreciated!

Workaround (for now)

Patching the file and bind-mounting it over the container version gets training running:

apptainer run --nv nemo.sif \
    cat /opt/megatron-lm/megatron/training/utils.py > utils_patched.py

sed -i 's/_broadcast_cu_seqlens(batch\[.cu_seqlens.\])/_broadcast(batch["cu_seqlens"])/g' \
    utils_patched.py

apptainer run --nv \
    --bind utils_patched.py:/opt/megatron-lm/megatron/training/utils.py \
    nemo.sif \
    python train.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions