Skip to content

About the CLS token for the llama3_2_vision_encoder #2268

Open
@dfloreaa

Description

From what I have seen, the llama3_2_vision_encoder seems to use a CLIP encoder for the images, which itself is an instance of a VisionTransformer, which itself has a cls_token_embedding module, which takes append_cls_token as an argument.

As the llama3_2_vision_encoder sets this variable to False, the CLS token is added "to the beginning of the sequence", this being the input of the transformer pre self-attention. Does this CLS token (once output from the transformer on a [1601, 4096] vector, being the first element) even encode the image's information?

I do not know the attention mask which the model is working with, I would guess it's all 1s for the vision side of things and therefore, it does actually contain the contents of the image, but if so, why does it make a difference to place it in front or in the back of the input?

I am looking to get a single (1, 4096) vector that encodes the images content (this for a research project) and I am curious if this vector will do the trick or not.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

discussionStart a discussiontriagedThis issue has been assigned an owner and appropriate label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions