-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Describe the bug
With sequence_parallel = True, embed_tokens are sharded on sequence dimension, see here. Then if position_ids is not provided as an explicit argument to forward, position_ids will us https://github.com/huggingface/transformers/blob/e20df45bf676d80bdddb9757eeeafe6c0c81ecfa/src/transformers/models/qwen3/modeling_qwen3.py#L380-L381 the cache_position's shape to create the position_ids, however, the cache_position's shape is based on the local embed_tokens shape.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.