Skip to content
Thoughts on GitHub Models?

Llama-3.2-90B-Vision-Instruct

Advanced image reasoning capabilities for visual understanding agentic apps.
Context
128k input · 4k output
Training date
Undisclosed
Rate limit tier
Provider support
Try Llama-3.2-90B-Vision-Instruct
Get early access to our playground for modelsJoin our limited beta waiting list today and be among the first to try out an easy way to test models

Model navigation navigation

Meta

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Model Developer: Meta

Model Architecture

Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.

Training Data Params Input modalities Output modalities Context length GQA Data volume Knowledge cutoff
Llama 3.2-Vision (Image, text) pairs 11B (10.6) Text + Image Text 128k Yes 6B (image, text) pairs December 2023
Llama 3.2-Vision (Image, text) pairs 90B (88.8) Text + Image Text 128k Yes 6B (image, text) pairs December 2023

Supported Languages: For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.

Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.

Training Data

Overview: Llama 3.2-Vision was pretrained on 6B image and text pairs. The instruction tuning data includes publicly available vision instruction datasets, as well as over 3M synthetically generated examples.

Data Freshness: The pretraining data has a cutoff of December 2023.

Languages

 (1)
English

About

Advanced image reasoning capabilities for visual understanding agentic apps.
Context
128k input · 4k output
Training date
Undisclosed
Rate limit tier
Provider support

Languages

 (1)
English