[Preprint] [Models on Hugging Face] [Project Page]
This repository contains inference code for pretrained CASA models; We are planning to release training code in the future. Alongside the code, we release several CASA variants with 2B to 3B parameters, which can be found in the following HuggingFace collection. Below you can find example code to run the models, evaluate them and use them for live captioning of videos.
For more technical details on CASA, see our project page and preprint.
(i) Standard self-attention layer |
(ii) CASA layer w/ local attention windows |
CASA is a vision-language fusion paradigm that aims to improve on cross-attention while preserving its practical benefits. Specifically, CASA layers inject visual tokens into a text stream by using image-to-text cross-attention while additionally enabling text-to-text self interaction in the same layer, and contained to smaller local attention windows. This simple modification enables natural gating in the cross-attention mechanism, improving its performance and substantially closing the gap to standard token insertion methods.
CASA models process and fuse vision and text inputs through two mechanisms:
- (i) Standard self-attention layers process only the text tokens, for the full context of the current sequence (right).
- (i) CASA layers process both text and image tokens but in local attention windows (left). The windows are defined by the points at which images occur in the stream, e.g. between two video frames. To improve efficiency, CASA layers also leverage asymmetric block-wise attention implemented using Flash Attention.
We release CASA models for both image-based and video streaming settings, based on two backbones:
- Helium1-2B, a text-only LLM which we fully fine-tune alongside CASA layers to produce a VLM which uses CASA fusion rather than token insertion.
- Qwen2.5-VL-3B, a pretrained VLM which originally handles visual inputs by directly adding image tokens to its token stream. In this setting we keep the backbone VLM frozen and adapt it to CASA by training only the additional CASA layers.
In both cases, images are embedded using the Qwen2.5-VL visual encoder, whose last four blocks are fine-tuned before feeding visual features into CASA.
All models we release are first trained on a combination of:
- the FineVision dataset, and
- a subset of LLaVA-OneVision-1.5,
These two datasets together cover a wide range of tasks including image captioning, document and chart understanding, and general visual question answering.
🔹 CASA models
We releasekyutai/CASA-Helium1-VL-2Bandkyutai/CASA-Qwen2_5-VL-3B, pretrained on this image-based training set.
🔹 Token insertion baseline
In addition to CASA-based models, we releasekyutai/Helium1-VL-2B, a VLM based trained from Helium1-2B with direct token insertion.Helium1-VL-2Bachieves state-of-the-art performance among insertion-based models of comparable size trained with publically available datasets.
For live video captioning, we further fine-tune our CASA-Qwen2_5-VL-3B models on the
Live-WhisperX-526K dataset, which is an instruction-style video dataset for live captioning, consisting of video frames sampled at 2 fps and interleaved with the corresponding text transcripts of the original video audio.
🔹 LiveCC CASA models.
We releaseCASA-Qwen2_5-VL-3B-LiveCC, further finetuned on Live-WhisperX for live streaming.
We recommend using uv to setup and run the code, as it will manage all Python dependencies for you transparently.
uv is provided as a lightweight binary which can be installed as follows:
curl -LsSf https://astral.sh/uv/install.sh | shWe provide a pyproject.toml with the minimal dependencies required to run inference with CASA models.
Below is a short snippet to show you how to load our models, process inputs, and run inference, using a standard HuggingFace transformers pipeline and chat template.
import torch
from transformers.models.auto.modeling_auto import AutoModel
from transformers.models.auto.processing_auto import AutoProcessor
model_id = "kyutai/CASA-Helium1-VL-2B"
model = AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
).cuda()
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "assets/casa_model.png",
},
{
"type": "text",
"text": "Describe this image.",
},
],
},
]
inputs = processor.tokenize_messages(messages=conversation)
inputs = inputs.to(model.device)
input_len = inputs["input_ids"].shape[1]
output_ids = model.generate_from_image(
**inputs,
max_new_tokens=512,
pre_image_tokens=processor.pre_image_tokens,
post_image_tokens=processor.post_image_tokens,
eos_token_id=model.generation_config.eos_token_id,
)[0, input_len:]
response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)We provide a script to caption a video using our CASA-Qwen2_5-VL-3B-LiveCC model and generate the resulting video with subtitles embedded at the actual time they are generated.
Note that you will also need to install ffmpeg for this script to run. The Python dependencies are handled with uv
# Script options
uv run scripts/gen_livecc_subtitles.py --help
# Generation with Qwen2.5VL+CASA
uv run scripts/gen_livecc_subtitles.py --sample_path path_to_video.mp4 --srt True --temp 0.0
# For long videos, you can also tweak the repetition penalty more precisely
uv run scripts/gen_livecc_subtitles.py --sample_path path_to_long_video.mp4 --repetition_penalty 1.15 --repetition_penalty_max_count 10 --repetition_penalty_decay 0.9Additional qualitative samples are available on our associated project page.
casa_readme_sample.mp4
The input video is taken from the Animal Kingdom dataset, and the subtitles displayed are generated with CASA-Qwen2_5-VL-3B-LiveCC.
Specifically, video frames are extracted at 2fps, and subtitles are displayed in real-time at the timestamp they are generated<
Transcript: "This video shows a fox in the Arctic. The Arctic is an area of Earth that's covered by ice and snow year -round, and it gets very cold there. Foxes are adapted to live in this cold environment because they have a thick layer of fur to keep them warm when they're out in the snow. This fox is walking through the snow and looking around for food or maybe just for safety from predators like wolves or bears that might be around. Foxes are also known for their ability to jump really high and"
We also provide a script for reproducing our reported results on standard VLM benchmark. We use lmms-eval as our main evaluation pipeline.
# Display command options
uv run scripts/inference.py --help
# Run inference on the ai2d dataset for the Helium1+CASA model
uv run scripts/inference.py CASA-Helium1-VL-2B --dataset_name ai2d
# Evaluate on all datasets sequentially
bash script/eval.sh CASA-Helium1-VL-2BUsing this pipeline, we evaluate our models CASA-Helium1-VL-2B, Helium1-VL-2B, and CASA-Qwen2_5-VL-3B
on a range of benchmarks covering document understanding (DocVQA), chart understanding (ChartQA, InfoVQA),
visual text reading (TextVQA, OCRBench), and general QA (RealWorldQA, AI2D, GQA, MME). Results are reported below. Please refer to our project page and arxiv paper for additional evaluation.
| Model | Document / Chart | Scene Text | Knowledge / QA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ChartQA | DocVQA | InfoVQA | OCRBench | TextVQA | RealWorldQA | AI2D | GQA | MME | |
| Helium1-VL-2B | 81.6 | 89.1 | 61.8 | 728 | 75.5 | 59.9 | 67.7 | 55.5 | 1732 |
| CASA-Helium1-VL-2B | 73.4 | 83.7 | 48.6 | 723 | 71.0 | 58.3 | 63.3 | 54.6 | 1572 |
| mPLUG-Owl3 8B | 59.2† | 55.9† | 36.8† | 527† | 69.0 | 63.9† | 73.4 | 65.0 | 1940† |
| mPLUG-Owl3 2B | 48.5† | 48.2† | 28.1† | 450† | 62.6 | 56.9† | 62.6 | 61.0 | 1551† |
† Reproduced with the publicly available models on Hugging Face.
Results for CASA-Helium1-VL-2B compared to a recent cross-attention baseline (blue), and our token insertion
(Helium1-VL-2B trained in the same conditions. CASA outperforms current SoTA
cross-attention-based VLMs, narrowing the gap to insertion-based approaches.
| Model | Document / Chart | Scene Text | Knowledge / QA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ChartQA | DocVQA | InfoVQA | OCRBench | TextVQA | RealWorldQA | AI2D | GQA | MME | |
| Qwen2.5-VL-3B | 84.0 | 93.6 | 77.1 | 797 | 79.3 | 62.2† | 81.6 | 61.0† | 2249† |
| CASA-Qwen2_5-VL-3B | 82.4 | 88.9 | 59.6 | 790 | 77.4 | 62.5 | 75.1 | 59.4 | 1918 |
† Reproduced with the publicly available models on Hugging Face.
Results for CASA-Qwen2_5-VL-3B, adapted from frozen Qwen2.5-VL. CASA reaches performance close to the original
insertion-based model while while training only
the CASA layers and last blocks of the image encoder.
The present code is provided under the MIT license.
The weights for the models are released under the CC-BY-NC-SA 4.0 license.
Some of the model weights include weights from the Qwen2.5-VL-3B model (namely, the image encoder for CASA-Helium1-VL-2B and Helium1-VL-2B, as well as the VLM backbone for CASA-Qwen2_5-VL-3B and CASA-Qwen2_5-VL-3B-LiveCC). Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
If you use CASA in your research, please cite our work:
@article{kyutai2025casa,
author = {Moritz B\"ohle and Am\'elie Royer and Juliette Marrie and Edouard Grave and Patrick P\'erez},
year = {2025},
title = {CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion},
journal = {ArXiv},
url = {https://arxiv.org/abs/2512.19535}
}



