The Embedder module in gemma/gm/nn/_modules.py currently implements an encode_vision method which acts as the critical "bridge" for multimodal inference. This method projects visual features (e.g., from SigLiP) into the Transformer's unified embedding space using RMSNorm and an Einsum projection.
Currently, there are no dedicated unit tests for this path, as noted by the TODO at line 74 of gemma/gm/nn/_modules_test.py.
Goal:
- Implement a robust test suite for
encode_vision.
- Verify that initializing the
Embedder with vision_proj_dim correctly creates the mm_input_projection and mm_soft_embedding_norm parameters.
- Ensure that visual tokens are correctly projected to the model's
embed_dim.