The Qwen3-VL technical report states the vision encoder is "initialized from official pretrained [SigLIP2] checkpoints." However, SigLIP2's patch embedding is a Conv2d with weight shape (1152, 3, 14, 14), while Qwen3-VL's patch_embed.proj is a Conv3d with weight shape (1152, 3, 2, 14, 14) — these are shape-incompatible and cannot be directly copied.
How was this layer specifically handled during initialization? For example:
(a) Randomly initialized (kaiming/glorot), with the rest of SigLIP2 transferred normally
(b) Inflated from SigLIP2 via repeat-and-rescale along the temporal axis (e.g. siglip_weight.unsqueeze(2).repeat(1,1,2,1,1) / 2)
(c) Something else
Were multiple strategies tested, and did the choice matter for training stability or final performance?
This is undocumented in the report and hasn't been discussed publicly. It would be useful for practitioners adapting other image encoders for video.
The Qwen3-VL technical report states the vision encoder is "initialized from official pretrained [SigLIP2] checkpoints." However, SigLIP2's patch embedding is a Conv2d with weight shape (1152, 3, 14, 14), while Qwen3-VL's patch_embed.proj is a Conv3d with weight shape (1152, 3, 2, 14, 14) — these are shape-incompatible and cannot be directly copied.
How was this layer specifically handled during initialization? For example:
(a) Randomly initialized (kaiming/glorot), with the rest of SigLIP2 transferred normally
(b) Inflated from SigLIP2 via repeat-and-rescale along the temporal axis (e.g. siglip_weight.unsqueeze(2).repeat(1,1,2,1,1) / 2)
(c) Something else
Were multiple strategies tested, and did the choice matter for training stability or final performance?
This is undocumented in the report and hasn't been discussed publicly. It would be useful for practitioners adapting other image encoders for video.