This repository is what powers the build experience, showcasing video search and summarization agent with NVIDIA NIM microservices.
Insightful, accurate, and interactive video analytics AI agents enable a range of industries to make better decisions faster. These AI agents are given tasks through natural language and can perform complex operations like video summarization and visual question-answering, unlocking entirely new application possibilities. The NVIDIA AI Blueprint makes it easy to get started building and customizing video analytics AI agents for video search and summarization — all powered by generative AI, vision language models (VLMs) like Cosmos Nemotron VLMs, large language models (LLMs) like Llama Nemotron LLMs, NVIDIA NeMo Retriever, and NVIDIA NIM.
-
NIM microservices: Here are models used in this blueprint:
-
Ingestion Pipeline:
The process involves decoding video segments (chunks) generated by the stream handler, selecting frames, and using a vision-language model (VLM) along with a caption prompt to generate detailed captions for each chunk. These dense captions are then indexed into vector and graph databases for use in the Context-Aware Retrieval-Augmented Generation workflow.
-
Context Manager:
Efficiently incorporates tools — a vision-language model (VLM) and a large language model (LLM), using them as required. Key functions including a summary generator, an answer generator, and an alert handler. The tools and functions are used in summary generation, handling Q&A, and managing alerts. In addition, context manager effectively maintains its working context by making efficient use of both short-term memory, such as chat history, and long-term memory resources like vector and graph databases, as needed.
-
CA-RAG module:
The Context-Aware Retrieval-Augmented Generation (CA-RAG) module leverages both Vector RAG and Graph-RAG as the primary sources for video understanding. During the Q&A workflow, the CA-RAG module extracts relevant context from the vector database and graph database to enhance temporal reasoning, anomaly detection, multi-hop reasoning, and scalability, thereby offering deeper contextual understanding and efficient management of extensive video data.
- Apply for Early Access Program for Video Search And Summarization NVIDIA AI Blueprint.
- Login to NGC Portal with the same account you applied for early access.
- Follow the steps here to obtain an NGC API Key.
This API Key (NVIDIA_API_KEY
) will be used to pull the blueprint container and other models that will be used as part of the blueprint.
The following Nvidia GPUs are supported:
- 8 x H100 (80 GB)
- 4 x H100 (80 gb) (requires helm chart override)
- 8 x A100 (80 GB)
- 8 x L40S (48 GB)
500+ GB system memory
The following Nvidia GPUs are supported if remote endpoints are being used:
All models remote: A6000, L40s, A100 (40 gb)
Local VLM: A100 (80 gb), H100, H200 (Use remote deployment )
Follow the notebook in deploy directory to complete all pre-requisites and deploy the blueprint using Brev Launchable in an 8xL040s Crusoe instance.
- deploy/1_Deploy_VSS_docker_Crusoe.ipynb: This notebook is tailored spacifically for the Crusoe CSP which uses Ephemeral storage.
There are also 3 Docker Compose deployments for local and remote deployments
- Ubuntu 22.04
- NVIDIA driver 535.161.08 (Recommended minimum version)
- CUDA 12.2+ (CUDA driver installed with NVIDIA driver)
- Docker Compose v2.32.4
Once approved for Early Access, the Members page will contain a link to 'Download helm chart from NGC' and 'Documentation'. Follow the guide to deploy the blueprint with Helm Chart.
- Ubuntu 22.04
- NVIDIA driver 535.161.08 (Recommended minimum version)
- CUDA 12.2+ (CUDA driver installed with NVIDIA driver)
- Kubernetes v1.31.2
- NVIDIA GPU Operator v23.9
- Helm v3.x
CVE | Description |
---|---|
CVE-2024-11393 | This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers MaskFormer Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since MaskFormer model is not used in VSS. |
CVE-2024-11392 | This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers MobileViTV2 Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since MobileViTV2 model is not used in VSS. |
CVE-2024-11394 | This impacts the transformers v4.47.0 python package. This impacts the Hugging Face Transformers Trax Model Deserialization and allows remote attackers to execute arbitrary code. User interaction is required to exploit this vulnerability in that the target must visit a malicious page or open a malicious file. However, this does not affect VSS since Trax model is not used in VSS. |
The code in this repository is licensed under the Apache License, Version 2.0.
The software and materials through the Early Access program are governed by the NVIDIA Software and Model Evaluation License Agreement.