Vision-language models for automated laboratory workflow documentation
Semester Project | EPFL AI Team | Fall 2025
A research collaboration between Annaelle Myriam Benlamri (MSc Data Science) and Marcus Hamelink (BSc Computer Science), supervised by Prof. Andrea Cavallaro.
Presented at: AMLD 2026 | Project Page: epflaiteam.ch/projects/iris
Manual documentation in laboratories is a tedious and error-prone task. IRIS addresses this by equipping researchers with a wearable camera that streams first-person video to a remote server, where a vision-language model generates structured logs of the procedure in near real-time.
The project was developed in two parallel research tracks, both using Qwen2.5-VL as a foundation but exploring different strategies for making it understand laboratory actions. The system was demonstrated on colony counting workflows at CHUV (Lausanne University Hospital), a procedure where researchers typically process over 30 petri dishes at a time, counting and manually transcribing results across hours of repetitive work.
The code for each track lives in sft-vlm-finetune/ (Marcus) and vlm_fusion/ (Annaelle), with the demo pipeline in src/iris/.
Investigated specialized video action recognition models and two strategies for integrating them with a VLM, evaluated on the FineBio Dataset.
-
Backbone: VideoMAE V2 (ViT-Base distilled from ViT-Giant, https://huggingface.co/OpenGVLab/VideoMAE2/tree/main/distill), pretrained on 1.35M unlabeled clips and fine-tuned on the processed FineBio dataset.
-
Fusion strategy 1, Prompt injection: Top-k predicted actions from VideoMAE V2 are formatted as structured context and injected into Qwen2.5-VL's prompt
-
Fusion strategy 2, Deep fusion: VideoMAE V2 spatiotemporal tokens are compressed via a Perceiver Resampler and projected directly into Qwen2.5-VL's embedding space via a trainable MLP, with both backbones frozen
-
Code:
vlm_fusion/ -
Models on HuggingFace: videomaev2-finetuned-finebio, videomae-qwen-connectors
Built the full streaming pipeline from hardware to inference server, and fine-tuned Qwen2.5-VL (3B) via supervised fine-tuning on the FineBio dataset for structured laboratory action description.
- Pipeline: Raspberry Pi 5 client (camera capture and WebSocket streaming) to a FastAPI inference server (async producer-consumer queue) with a React frontend showing live results and session management
- Fine-tuning: LoRA (r=16, alpha=32) on 9K stratified FineBio samples, trained to output structured JSON descriptions of laboratory actions
- Two operational modes: live streaming with a few seconds of inference latency, and batch analysis of pre-recorded video with automated report generation
- Report:
Marcus-Hamelink-IRIS-Report.pdf - Model on HuggingFace: animarcus/iris-qwen2.5-vl-3b-finebio
- Inference server:
src/iris/server/ - Client backend and frontend:
src/iris/client/ - VLM fine-tuning:
sft-vlm-finetune/- dataset prep, training scripts, evaluation
Live mode streams from the camera in real-time, with results appearing as inference completes. Analysis mode runs on a pre-recorded video, producing a timeline visualization and a generated report of the procedure.
Left: live documentation mode. Right: analysis mode with timeline and report generation.
Detailed demo video can be found here
The system is designed to run with the inference server on a GPU machine (the project used EPFL's Izar and RCP clusters) and the client on a local machine or Raspberry Pi. It can be run fully locally if the machine has sufficient GPU memory to load the model.
git clone https://github.com/EPFL-AI-Team/IRIS
cd IRIS
uv sync
# Terminal 1 - inference server (GPU required)
uv run iris-server
# Terminal 2 - client and web interface
uv run iris-clientWeb interface available at http://localhost:8006. For full setup, configuration options, and Raspberry Pi instructions, see docs/setup.md.
| Document | Description |
|---|---|
| docs/setup.md | Local setup and Raspberry Pi configuration |
| docs/cluster-setup.md | Running on EPFL Izar and RCP clusters |
| docs/rcp-guide.md | VLM training, evaluation, and inference CLI reference |
| docs/API.md | REST and WebSocket API reference |
| sft-vlm-finetune/ | VLM fine-tuning - dataset prep, training, evaluation |
| vlm_fusion/ | Action recognition and deep fusion architecture |
- Supervisor: Prof. Andrea Cavallaro (EPFL AI Team)
- Track Lead: Louis Vasseur (EPFL AI Team)
- Domain expertise and videos: CHUV (Lausanne University Hospital)
License: MIT License



