I'm trying to fine-tune Qwen3.5 models (e.g. Qwen/Qwen3.5-9B) using qwen-vl-finetune.
Model loading works via AutoModelForImageTextToText.from_pretrained() (resolves to Qwen3_5ForConditionalGeneration). However, the data processor computes position IDs using get_rope_index_3 when model_type="qwen3vl".
However, Qwen3.5 has a different architecture (Gated DeltaNet + partial RoPE with partial_rotary_factor=0.25) compared to Qwen3-VL's 3D mRoPE with temporal/spatial sections. Is get_rope_index_3 correct for Qwen3.5, or does it need a dedicated position ID function?
I'm trying to fine-tune Qwen3.5 models (e.g. Qwen/Qwen3.5-9B) using qwen-vl-finetune.
Model loading works via AutoModelForImageTextToText.from_pretrained() (resolves to Qwen3_5ForConditionalGeneration). However, the data processor computes position IDs using get_rope_index_3 when model_type="qwen3vl".
However, Qwen3.5 has a different architecture (Gated DeltaNet + partial RoPE with partial_rotary_factor=0.25) compared to Qwen3-VL's 3D mRoPE with temporal/spatial sections. Is get_rope_index_3 correct for Qwen3.5, or does it need a dedicated position ID function?