`UnboundLocalError: _broadcast_cu_seqlens` when running pretrain_gpt.py in nemo:25.11 container

hey, im trying to run GPT-2 pretraining using the nemo 25.11 container with megatron and i keep getting this error no matter what flags i pass. I've tried different combos of `--tokenizer-type`, `--dataloader-type`, different TP/PP configs, nothing works. every single run crashes at the same place.

im running on 4x L40 gpus using apptainer and training on wikitext-103.

## Error

```
[rank0]: File "/opt/megatron-lm/megatron/training/utils.py", line 562, in get_batch_on_this_tp_rank
[rank0]:     _broadcast_cu_seqlens(batch['cu_seqlens'])
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^
[rank0]: UnboundLocalError: cannot access local variable '_broadcast_cu_seqlens' where it is not associated with a value
```

all 4 ranks crash with the same error

## How to reproduce

```bash
torchrun --nproc_per_node=4 /opt/megatron-lm/pretrain_gpt.py \
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
    --num-layers 12 \
    --hidden-size 768 \
    --num-attention-heads 12 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 2 \
    --global-batch-size 16 \
    --train-iters 50 \
    --lr 0.0001 \
    --min-lr 0.00001 \
    --lr-decay-style cosine \
    --lr-warmup-fraction 0.01 \
    --clip-grad 1.0 \
    --weight-decay 0.1 \
    --bf16 \
    --tokenizer-type GPT2BPETokenizer \
    --dataloader-type single \
    --data-path /path/to/gpt2_wikitext103_text_document \
    --vocab-file /path/to/gpt2-vocab.json \
    --merge-file /path/to/gpt2-merges.txt \
    --split 949,50,1
```

## Environment

- **Container**: `nvcr.io/nvidia/nemo:25.11` (build 259213668)
- **GPUs**: 4x NVIDIA L40
- **Runtime**: Apptainer

## What I think might be going on

So I was working through this with Claude and it pointed out what looks like a python scoping bug in `get_batch_on_this_tp_rank()` in `megatron/training/utils.py`. I'm not 100% sure if this is actually a bug or if I'm just passing the wrong arguments, but here's what Claude found:

The function has two branches — an `if data is not None` branch (for the TP rank that has the data) and an `else` branch (for ranks that receive data via broadcast).

In the `else` branch at **line 620**, there's a local function definition:
```python
def _broadcast_cu_seqlens():   # no arguments, creates empty tensors to receive broadcast
    ...
```

But in the `if` branch at **lines 562 and 569**, the code calls:
```python
_broadcast_cu_seqlens(batch['cu_seqlens'])   # with an argument
```

Claude explained that because Python sees the `def` at line 620, it treats `_broadcast_cu_seqlens` as a local variable for the entire enclosing function. So when the `if` branch runs first and tries to call it at line 562, it's marked as local but hasn't been assigned yet → `UnboundLocalError`.

If this is right, then no CLI flags would fix it because every TP source rank takes the `if` branch, and with TP=1 that's every rank. The batch always has a `cu_seqlens` key and the call is unconditional.

## Possible fix

Claude suggested replacing lines 562 and 569 with `_broadcast(batch['cu_seqlens'])` to match the pattern used for every other tensor in that branch:

```diff
         if args.pipeline_model_parallel_size == 1 or mtp_on_this_rank:
             _broadcast(batch['tokens'])
             _broadcast(batch['labels'])
             _broadcast(batch['loss_mask'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
-            _broadcast_cu_seqlens(batch['cu_seqlens'])
+            _broadcast(batch['cu_seqlens'])
             _broadcast(batch['max_seqlen'])

         elif mpu.is_pipeline_first_stage():
             _broadcast(batch['tokens'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
-            _broadcast_cu_seqlens(batch['cu_seqlens'])
+            _broadcast(batch['cu_seqlens'])
             _broadcast(batch['max_seqlen'])
```

Again not sure if this is the right fix or if im just doing something wrong on my end. Any guidance appreciated!

## Workaround (for now)

Patching the file and bind-mounting it over the container version gets training running:

```bash
apptainer run --nv nemo.sif \
    cat /opt/megatron-lm/megatron/training/utils.py > utils_patched.py

sed -i 's/_broadcast_cu_seqlens(batch\[.cu_seqlens.\])/_broadcast(batch["cu_seqlens"])/g' \
    utils_patched.py

apptainer run --nv \
    --bind utils_patched.py:/opt/megatron-lm/megatron/training/utils.py \
    nemo.sif \
    python train.py
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`UnboundLocalError: _broadcast_cu_seqlens` when running pretrain_gpt.py in nemo:25.11 container #3848

Error

How to reproduce

Environment

What I think might be going on

Possible fix

Workaround (for now)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UnboundLocalError: _broadcast_cu_seqlens when running pretrain_gpt.py in nemo:25.11 container #3848

Description

Error

How to reproduce

Environment

What I think might be going on

Possible fix

Workaround (for now)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`UnboundLocalError: _broadcast_cu_seqlens` when running pretrain_gpt.py in nemo:25.11 container #3848