PaliGemma: Multimodal Vision Language Model

About the project

This repository contains an implementation of PaliGemma, a multimodal (Vision) language model written from scratch in PyTorch. The code is based on the tutorial video ‘Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation’.

The main goal of the project is to deepen the understanding of the device of multimodal models and improve programming skills in PyTorch.

How to start

Prerequisites

Before you start, make sure you have the following dependencies installed:

- python = ^3.11
- torch = ^2.5.1
- numpy = ^2.2.1
- pillow = ^11.1.0
- fire = ^0.7.0
- transformers = ^4.48.0

Clone the repository:

git clone https://github.com/vlvink/PaliGemma-from-scratch.git
cd PaliGemma-from-scratch

Install the requirements

poetry install

Setting the poetry environment

poetry shell

Running the Code

./launch_inference.sh

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
launch_inference.sh		launch_inference.sh
modeling_gemma.py		modeling_gemma.py
modeling_siglip.py		modeling_siglip.py
poetry.lock		poetry.lock
processing_paligemma.py		processing_paligemma.py
pyproject.toml		pyproject.toml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PaliGemma: Multimodal Vision Language Model

About the project

How to start

Prerequisites

Running the Code

About

Uh oh!

Releases

Packages

Uh oh!

Languages

vlvink/PaliGemma-from-scratch

Folders and files

Latest commit

History

Repository files navigation

PaliGemma: Multimodal Vision Language Model

About the project

How to start

Prerequisites

Running the Code

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages