Skip to content

Qwen parallelizer with sequence parallelism #652

@akoumpa

Description

@akoumpa

Describe the bug

With sequence_parallel = True, embed_tokens are sharded on sequence dimension, see here. Then if position_ids is not provided as an explicit argument to forward, position_ids will us https://github.com/huggingface/transformers/blob/e20df45bf676d80bdddb9757eeeafe6c0c81ecfa/src/transformers/models/qwen3/modeling_qwen3.py#L380-L381 the cache_position's shape to create the position_ids, however, the cache_position's shape is based on the local embed_tokens shape.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions