Skip to content

About the CLS token for the llama3_2_vision_encoder #2268

Open
@dfloreaa

Description

From what I have seen, the llama3_2_vision_encoder seems to use a CLIP encoder for the images, which itself is an instance of a VisionTransformer, which itself has a cls_token_embedding module, which takes append_cls_token as an argument.

As the llama3_2_vision_encoder sets this variable to False, the CLS token is added "to the beginning of the sequence", this being the input of the transformer pre self-attention. Does this CLS token (once output from the transformer on a [1601, 4096] vector, being the first element) even encode the image's information?

I do not know the attention mask which the model is working with, I would guess it's all 1s for the vision side of things and therefore, it does actually contain the contents of the image, but if so, why does it make a difference to place it in front or in the back of the input?

I am looking to get a single (1, 4096) vector that encodes the images content (this for a research project) and I am curious if this vector will do the trick or not.

Metadata

Labels

discussionStart a discussiontriagedThis issue has been assigned an owner and appropriate label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions