A collection of notebooks demonstrating the capabilities of NVIDIA Nemotron Nano 2 VL, a 12B parameter model that unifies visual and textual understanding for advanced multimodal agentic workflows.
These notebooks show how to use NVIDIA Nemotron Nano 2 VL to build applications that can see, read, and reason across diverse media. The model can extract, understand, and act on information from text, images, and videos, making it a powerful tool for next-generation AI agents.
- VLM (NIM):
nvidia/nemotron-nano-2-vl(Available soon on NVIDIA AI Endpoints) - VLM (Hugging Face):
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8(link) - VLM (Hugging Face):
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16(link)
- Agentic Multimodal Reasoning: Unifies visual and textual understanding to extract, reason, and act on information.
- Versatile Inputs: Natively handles text prompts, image URLs, and video URLs in a single request.
- Controllable Reasoning: Use the
/thinksystem prompt to enable detailed reasoning steps and/no_thinkfor direct answers. - Multi-Image Understanding: Capable of reasoning across multiple images, such as different pages of a PDF, to answer complex questions.
- Advanced Video Analysis: Performs dense captioning and summarization of video content.
- Efficient Video Sampling (EVS): Automatically prunes redundant video frames to enable efficient long-context reasoning.
- Hybrid Mamba-Transformer Architecture: Delivers high accuracy with superior throughput and lower latency.
- NVIDIA API key (get one here)
- GPU recommended for local deployment (e.g., single H100)