Qwen parallelizer with sequence parallelism

**Describe the bug**

With `sequence_parallel = True`, `embed_tokens` are sharded on sequence dimension, see [here](https://github.com/NVIDIA-NeMo/Automodel/blob/8e153388abcab53ccdcd79f4de94113a941a35e1/nemo_automodel/components/distributed/optimized_tp_plans.py#L185-L188). Then if `position_ids` is not provided as an explicit argument to `forward`,  `position_ids` will us https://github.com/huggingface/transformers/blob/e20df45bf676d80bdddb9757eeeafe6c0c81ecfa/src/transformers/models/qwen3/modeling_qwen3.py#L380-L381 the `cache_position`'s shape to create the `position_ids`, however, the `cache_position`'s shape is based on the local `embed_tokens` [shape](https://github.com/huggingface/transformers/blob/e20df45bf676d80bdddb9757eeeafe6c0c81ecfa/src/transformers/models/qwen3/modeling_qwen3.py#L377).

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen parallelizer with sequence parallelism #652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen parallelizer with sequence parallelism #652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions