This repository contains an implementation of PaliGemma, a multimodal (Vision) language model written from scratch in PyTorch. The code is based on the tutorial video ‘Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation’.
The main goal of the project is to deepen the understanding of the device of multimodal models and improve programming skills in PyTorch.
Before you start, make sure you have the following dependencies installed:
- python = ^3.11
- torch = ^2.5.1
- numpy = ^2.2.1
- pillow = ^11.1.0
- fire = ^0.7.0
- transformers = ^4.48.0
- Clone the repository:
git clone https://github.com/vlvink/PaliGemma-from-scratch.git
cd PaliGemma-from-scratch
- Install the requirements
poetry install
- Setting the poetry environment
poetry shell
./launch_inference.sh