Skip to content

fix: position-ids in qwen3-next#1767

Draft
akoumpa wants to merge 3 commits intomainfrom
akoumparouli/fix_pos_ids_in_qwen3_next
Draft

fix: position-ids in qwen3-next#1767
akoumpa wants to merge 3 commits intomainfrom
akoumparouli/fix_pos_ids_in_qwen3_next

Conversation

@akoumpa
Copy link
Copy Markdown
Contributor

@akoumpa akoumpa commented Apr 10, 2026

What does this PR do ?

torchrun --nproc-per-node 8 nemo_automodel/recipes/llm/benchmark.py --config examples/benchmark/configs/qwen3_next_te_deepep.yaml --model.config.num_hidden_layers 8 --distributed.ep_size 8

What it does: Benchmarks Qwen3-Next-80B-A3B-Instruct with TE + DeepEP, 8 GPUs. Crashes on the very first forward pass.

Traceback:

benchmark.py:246      → _forward_backward_step(...)
train_ft.py:1343      → out = model(**batch)
qwen3_next/model.py:296 → self.model(...)
qwen3_next/model.py:202 → layer(...)
qwen3_next/model.py:77  → self.linear_attn(
    ..., position_ids=position_ids)   ← CRASH

TypeError: Qwen3NextGatedDeltaNet.forward() got an unexpected keyword argument 'position_ids'
Root cause: In nemo_automodel/components/models/qwen3_next/model.py:77, the decoder layer passes position_ids to self.linear_attn(). But Qwen3NextGatedDeltaNet.forward() doesn't accept that parameter — gated delta-nets encode position implicitly and have a different signature than standard attention.

So Qwen3NextGatedDeltaNet.forward() accepts:

(self, hidden_states, cache_params=None, cache_position=None, attention_mask=None)

Changelog

  • Add specific line by line info of high level changes in this PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa akoumpa added the r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Apr 10, 2026
@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 10, 2026

/ok to test 66a2804

The megatron_fsdp_strategy_parallelize function was missing the call
to _update_attention_head_counts_for_tp after applying tensor
parallelism via parallelize_module. The FSDP2 path
(DefaultParallelizationStrategy.parallelize) already performs this
update, but the MegatronFSDP path did not. Without this update,
attention modules retain global head counts after TP sharding, which
can cause incorrect GQA behavior and DTensor shape mismatches that
manifest as NCCL collective hangs.

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 12, 2026

/ok to test db1f660

@akoumpa
Copy link
Copy Markdown
Contributor Author

akoumpa commented Apr 13, 2026

/ok to test ce0d6ab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant