Skip to content

How was patch_embed.proj (Conv3d) initialized from SigLIP2 weights (Conv2d)? #2087

@ymp5078

Description

@ymp5078

The Qwen3-VL technical report states the vision encoder is "initialized from official pretrained [SigLIP2] checkpoints." However, SigLIP2's patch embedding is a Conv2d with weight shape (1152, 3, 14, 14), while Qwen3-VL's patch_embed.proj is a Conv3d with weight shape (1152, 3, 2, 14, 14) — these are shape-incompatible and cannot be directly copied.
How was this layer specifically handled during initialization? For example:

(a) Randomly initialized (kaiming/glorot), with the rest of SigLIP2 transferred normally
(b) Inflated from SigLIP2 via repeat-and-rescale along the temporal axis (e.g. siglip_weight.unsqueeze(2).repeat(1,1,2,1,1) / 2)
(c) Something else

Were multiple strategies tested, and did the choice matter for training stability or final performance?
This is undocumented in the report and hasn't been discussed publicly. It would be useful for practitioners adapting other image encoders for video.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions