CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

[Preprint] [Models on Hugging Face] [Project Page]

This repository contains inference code for pretrained CASA models; We are planning to release training code in the future. Alongside the code, we release several CASA variants with 2B to 3B parameters, which can be found in the following HuggingFace collection. Below you can find example code to run the models, evaluate them and use them for live captioning of videos.

For more technical details on CASA, see our project page and preprint.

CASA in a nutshell

(i) Standard self-attention layer

(ii) CASA layer w/ local attention windows

CASA is a vision-language fusion paradigm that aims to improve on cross-attention while preserving its practical benefits. Specifically, CASA layers inject visual tokens into a text stream by using image-to-text cross-attention while additionally enabling text-to-text self interaction in the same layer, and contained to smaller local attention windows. This simple modification enables natural gating in the cross-attention mechanism, improving its performance and substantially closing the gap to standard token insertion methods.

CASA models process and fuse vision and text inputs through two mechanisms:

(i) Standard self-attention layers process only the text tokens, for the full context of the current sequence (right).
(i) CASA layers process both text and image tokens but in local attention windows (left). The windows are defined by the points at which images occur in the stream, e.g. between two video frames. To improve efficiency, CASA layers also leverage asymmetric block-wise attention implemented using Flash Attention.

Models

We release CASA models for both image-based and video streaming settings, based on two backbones:

Helium1-2B, a text-only LLM which we fully fine-tune alongside CASA layers to produce a VLM which uses CASA fusion rather than token insertion.
Qwen2.5-VL-3B, a pretrained VLM which originally handles visual inputs by directly adding image tokens to its token stream. In this setting we keep the backbone VLM frozen and adapt it to CASA by training only the additional CASA layers.

In both cases, images are embedded using the Qwen2.5-VL visual encoder, whose last four blocks are fine-tuned before feeding visual features into CASA.

Image-based Models

All models we release are first trained on a combination of:

the FineVision dataset, and
a subset of LLaVA-OneVision-1.5,

These two datasets together cover a wide range of tasks including image captioning, document and chart understanding, and general visual question answering.

🔹 CASA models
We release kyutai/CASA-Helium1-VL-2B and kyutai/CASA-Qwen2_5-VL-3B, pretrained on this image-based training set.

🔹 Token insertion baseline
In addition to CASA-based models, we release kyutai/Helium1-VL-2B, a VLM based trained from Helium1-2B with direct token insertion. Helium1-VL-2B achieves state-of-the-art performance among insertion-based models of comparable size trained with publically available datasets.

Video Captioning Model

For live video captioning, we further fine-tune our CASA-Qwen2_5-VL-3B models on the
Live-WhisperX-526K dataset, which is an instruction-style video dataset for live captioning, consisting of video frames sampled at 2 fps and interleaved with the corresponding text transcripts of the original video audio.

🔹 LiveCC CASA models.
We release CASA-Qwen2_5-VL-3B-LiveCC, further finetuned on Live-WhisperX for live streaming.

Using the models

Setup

We recommend using uv to setup and run the code, as it will manage all Python dependencies for you transparently.

uv is provided as a lightweight binary which can be installed as follows:

curl -LsSf https://astral.sh/uv/install.sh | sh

We provide a pyproject.toml with the minimal dependencies required to run inference with CASA models.

Quick Start

Below is a short snippet to show you how to load our models, process inputs, and run inference, using a standard HuggingFace transformers pipeline and chat template.

import torch
from transformers.models.auto.modeling_auto import AutoModel
from transformers.models.auto.processing_auto import AutoProcessor

model_id = "kyutai/CASA-Helium1-VL-2B"
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).cuda()
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/casa_model.png",
            },
            {
                "type": "text",
                "text": "Describe this image.",
            },
        ],
    },
]
inputs = processor.tokenize_messages(messages=conversation)
inputs = inputs.to(model.device)
input_len = inputs["input_ids"].shape[1]
output_ids = model.generate_from_image(
  **inputs,
  max_new_tokens=512,
  pre_image_tokens=processor.pre_image_tokens,
  post_image_tokens=processor.post_image_tokens,
  eos_token_id=model.generation_config.eos_token_id,
)[0, input_len:]
response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)

Live Captioning

We provide a script to caption a video using our CASA-Qwen2_5-VL-3B-LiveCC model and generate the resulting video with subtitles embedded at the actual time they are generated.

Note that you will also need to install ffmpeg for this script to run. The Python dependencies are handled with uv

# Script options
uv run scripts/gen_livecc_subtitles.py --help
# Generation with Qwen2.5VL+CASA
uv run scripts/gen_livecc_subtitles.py --sample_path path_to_video.mp4 --srt True --temp 0.0
# For long videos, you can also tweak the repetition penalty more precisely
uv run scripts/gen_livecc_subtitles.py --sample_path path_to_long_video.mp4 --repetition_penalty 1.15 --repetition_penalty_max_count 10 --repetition_penalty_decay 0.9

Additional qualitative samples are available on our associated project page.

casa_readme_sample.mp4

The input video is taken from the Animal Kingdom dataset, and the subtitles displayed are generated with CASA-Qwen2_5-VL-3B-LiveCC.

Specifically, video frames are extracted at 2fps, and subtitles are displayed in real-time at the timestamp they are generated<

Transcript: "This video shows a fox in the Arctic. The Arctic is an area of Earth that's covered by ice and snow year -round, and it gets very cold there. Foxes are adapted to live in this cold environment because they have a thick layer of fur to keep them warm when they're out in the snow. This fox is walking through the snow and looking around for food or maybe just for safety from predators like wolves or bears that might be around. Foxes are also known for their ability to jump really high and"

Benchmark Evaluation

We also provide a script for reproducing our reported results on standard VLM benchmark. We use lmms-eval as our main evaluation pipeline.

# Display command options
uv run scripts/inference.py --help
# Run inference on the ai2d dataset for the Helium1+CASA model
uv run scripts/inference.py CASA-Helium1-VL-2B --dataset_name ai2d
# Evaluate on all datasets sequentially
bash script/eval.sh CASA-Helium1-VL-2B

Using this pipeline, we evaluate our models CASA-Helium1-VL-2B, Helium1-VL-2B, and CASA-Qwen2_5-VL-3B on a range of benchmarks covering document understanding (DocVQA), chart understanding (ChartQA, InfoVQA), visual text reading (TextVQA, OCRBench), and general QA (RealWorldQA, AI2D, GQA, MME). Results are reported below. Please refer to our project page and arxiv paper for additional evaluation.

Model	Document / Chart			Scene Text		Knowledge / QA
Model	ChartQA	DocVQA	InfoVQA	OCRBench	TextVQA	RealWorldQA	AI2D	GQA	MME
Helium1-VL-2B	81.6	89.1	61.8	728	75.5	59.9	67.7	55.5	1732
CASA-Helium1-VL-2B	73.4	83.7	48.6	723	71.0	58.3	63.3	54.6	1572
mPLUG-Owl3 8B	59.2^†	55.9^†	36.8^†	527^†	69.0	63.9^†	73.4	65.0	1940^†
mPLUG-Owl3 2B	48.5^†	48.2^†	28.1^†	450^†	62.6	56.9^†	62.6	61.0	1551^†

^† Reproduced with the publicly available models on Hugging Face.

Results for CASA-Helium1-VL-2B compared to a recent cross-attention baseline (blue), and our token insertion (Helium1-VL-2B trained in the same conditions. CASA outperforms current SoTA cross-attention-based VLMs, narrowing the gap to insertion-based approaches.

Model	Document / Chart			Scene Text		Knowledge / QA
Model	ChartQA	DocVQA	InfoVQA	OCRBench	TextVQA	RealWorldQA	AI2D	GQA	MME
Qwen2.5-VL-3B	84.0	93.6	77.1	797	79.3	62.2^†	81.6	61.0^†	2249^†
CASA-Qwen2_5-VL-3B	82.4	88.9	59.6	790	77.4	62.5	75.1	59.4	1918

^† Reproduced with the publicly available models on Hugging Face.

Results for CASA-Qwen2_5-VL-3B, adapted from frozen Qwen2.5-VL. CASA reaches performance close to the original insertion-based model while while training only the CASA layers and last blocks of the image encoder.

License

The present code is provided under the MIT license.

The weights for the models are released under the CC-BY-NC-SA 4.0 license.

Some of the model weights include weights from the Qwen2.5-VL-3B model (namely, the image encoder for CASA-Helium1-VL-2B and Helium1-VL-2B, as well as the VLM backbone for CASA-Qwen2_5-VL-3B and CASA-Qwen2_5-VL-3B-LiveCC). Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

Citation

If you use CASA in your research, please cite our work:

@article{kyutai2025casa,
  author = {Moritz B\"ohle and Am\'elie Royer and Juliette Marrie and Edouard Grave and Patrick P\'erez},
  year = {2025},
  title = {CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2512.19535}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
casa-animation		casa-animation
scripts		scripts
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

CASA in a nutshell

Models

Image-based Models

Video Captioning Model

Using the models

Setup

Quick Start

Live Captioning

Benchmark Evaluation

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

kyutai-labs/casa

Folders and files

Latest commit

History

Repository files navigation

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

CASA in a nutshell

Models

Image-based Models

Video Captioning Model

Using the models

Setup

Quick Start

Live Captioning

Benchmark Evaluation

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages