About the CLS token for the llama3_2_vision_encoder

From what I have seen, the `llama3_2_vision_encoder `seems to use a CLIP encoder for the images, which itself is an instance of a `VisionTransformer`, which itself has a `cls_token_embedding` module, which takes `append_cls_token` as an argument.

As the `llama3_2_vision_encoder` sets this variable to `False`, the CLS token is added `"to the beginning of the sequence"`, this being the input of the transformer pre self-attention. Does this CLS token (once output from the transformer on a `[1601, 4096]` vector, being the first element) even encode the image's information?

I do not know the attention mask which the model is working with, I would guess it's all 1s for the vision side of things and therefore, it does actually contain the contents of the image, but if so, why does it make a difference to place it in front or in the back of the input?

I am looking to get a single (1, 4096) vector that encodes the images content (this for a research project) and I am curious if this vector will do the trick or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About the CLS token for the llama3_2_vision_encoder #2268

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About the CLS token for the llama3_2_vision_encoder #2268

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions