How was patch_embed.proj (Conv3d) initialized from SigLIP2 weights (Conv2d)?

The Qwen3-VL technical report states the vision encoder is "initialized from official pretrained [SigLIP2] checkpoints." However, SigLIP2's patch embedding is a Conv2d with weight shape (1152, 3, 14, 14), while Qwen3-VL's patch_embed.proj is a Conv3d with weight shape (1152, 3, 2, 14, 14) — these are shape-incompatible and cannot be directly copied.
How was this layer specifically handled during initialization? For example:

(a) Randomly initialized (kaiming/glorot), with the rest of SigLIP2 transferred normally
(b) Inflated from SigLIP2 via repeat-and-rescale along the temporal axis (e.g. siglip_weight.unsqueeze(2).repeat(1,1,2,1,1) / 2)
(c) Something else

Were multiple strategies tested, and did the choice matter for training stability or final performance?
This is undocumented in the report and hasn't been discussed publicly. It would be useful for practitioners adapting other image encoders for video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How was patch_embed.proj (Conv3d) initialized from SigLIP2 weights (Conv2d)? #2087

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How was patch_embed.proj (Conv3d) initialized from SigLIP2 weights (Conv2d)? #2087

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions