About the CLS token for the llama3_2_vision_encoder #2268
Description
From what I have seen, the llama3_2_vision_encoder
seems to use a CLIP encoder for the images, which itself is an instance of a VisionTransformer
, which itself has a cls_token_embedding
module, which takes append_cls_token
as an argument.
As the llama3_2_vision_encoder
sets this variable to False
, the CLS token is added "to the beginning of the sequence"
, this being the input of the transformer pre self-attention. Does this CLS token (once output from the transformer on a [1601, 4096]
vector, being the first element) even encode the image's information?
I do not know the attention mask which the model is working with, I would guess it's all 1s for the vision side of things and therefore, it does actually contain the contents of the image, but if so, why does it make a difference to place it in front or in the back of the input?
I am looking to get a single (1, 4096) vector that encodes the images content (this for a research project) and I am curious if this vector will do the trick or not.