Does Llama 3.2 vision use CLIP? #2462

Unanswered

vedantroy asked this question in Q&A

vedantroy
Mar 5, 2025

2 questions really:

In the [source code] (https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama3_2_vision/_model_builders.py) for Llama 3.2 vision, there's a component called clip_vision_encoder. I assume this is not referencing OpenAI CLIP?
Is this a correct representation of the Llama 3.2 vision architecture? I read the paper + browsed the source code, and this is what I came up with:

flowchart LR
    vision["{image}"] --> clipT[CLIP Transformer]
    clipT --> t[t Transformer]
    t --> cat[cat along feature dim]
    
    clipT -- "hidden state 1" --> cat
    clipT -- "hidden state 2" --> cat
    clipT -- "hidden state 3" --> cat
    
    cat --> MLP
    
    MLP -- "cross attention" --> decoder[Decoder]
    text["{text}"] --> decoder

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment