- Split the image into patches
- Embed each patch into a lower dimensional vector using a linear projection layer.
- Add positional encodings
- Add an additional learnable classification token to the sequence that will be used to make predictions.
- Initialize the
classification tokenwith random values and then train it along with the rest of the model. - Pass the embeddings (including the classification token) through a series of encoder blocks.
- The output of the final encoder block is called the
pooled features or global representation of the image. The term "contextual embeddings" is not commonly used in the context of vision transformers. - Pass the
classification token's embeddingthrough a Multi-Layer Perceptron (MLP) to make predictions.
- ViT is a simple vision transformer architecture that replaces the convolutions in the backbone of the popular convolutional neural networks with a transformer encoder.





