-
Notifications
You must be signed in to change notification settings - Fork 341
Description
Problem
Currently, the visual_language_chat sample only supports static images loaded from a file path.
While this demonstrates the API, it does not showcase the real-time performance capabilities of OpenVINO GenAI on edge devices. Developers building visual agents or robots need a reference implementation for handling continuous video streams without blocking the inference loop.
Proposed Solution
I propose adding a new C++ sample: live_vlm_chat.
This sample integrates OpenCV to capture a live webcam feed and interacts with the VLM (e.g., LLaVA/Mistral) in real-time.
Key Features of the proposed sample:
- Multi-threaded Architecture: Decouples the UI/Camera loop (Main Thread) from the Inference loop (Worker Thread) to ensure the video feed never freezes while the LLM is "thinking."
- Thread Safety: Implements
std::mutexandstd::condition_variableto safely pass frames between threads. - MSVC Compatibility: Solves the C3889 build error on Windows by strictly defining tensor types (
std::vector<ov::Tensor>) for theov::genai::imagesconstructor. - Interactive UI: Allows users to "snap" a frame and chat with it while the camera continues running.
Implementation Details
I have already implemented and tested this locally on Windows 11 (Intel CPU & iGPU).
- Dependencies: Adds
OpenCV(core, highgui, videoio) as an optional dependency inCMakeLists.txt. - File:
samples/cpp/visual_language_chat/live_vlm_chat.cpp
I have a working implementation ready and tested. I can submit a Pull Request immediately if this contribution aligns with the project's roadmap.