This sample demonstrates an edge AI alerting pipeline using Vision-Language Models (VLMs).
It shows how to:
- Download a VLM from Hugging Face
- Convert it to OpenVINO IR using
optimum-cli - Run inference inside a DL Streamer pipeline
- Generate structured JSON alerts per processed frame
- Produce MP4 output
VLMs can help accurately detect rare or contextual events using natural language prompts — for example, detecting a police car in a traffic video. This enables alerting for events, like in prompts:
- Is there a police car?
- Is there smoke or fire?
- Is a person lying on the ground?
Any image-text-to-text model supported by optimum-intel can be used. Smaller models (1B-4B parameters) are recommended for edge deployment. For example, OpenGVLab/InternVL3_5-2B.
The script runs:
optimum-cli export openvino \
--model <model_id> \
--task image-text-to-text \
--trust-remote-code \
<output_dir>
Exported artifacts are stored under models/<ModelName>/.
The export runs once and is cached. To skip export, pass --model-path directly.
Similarly to model, provide either:
--video-pathfor a local file--video-urlto download automatically
Downloaded videos are cached under videos/.
The pipeline is built dynamically in Python using Gst.parse_launch.
graph LR
A[filesrc] --> B[decodebin3]
B --> C[gvagenai]
C --> D[gvametapublish]
D --> E[gvafpscounter]
E --> F[gvawatermark]
F --> G["encode (vah264enc + h264parse + mp4mux)"]
G --> H[filesink]
- Create and activate a virtual environment:
cd samples/gstreamer/python/vlm_alerts
python3 -m venv .vlm-venv
source .vlm-venv/bin/activate
- Install dependencies:
curl -LO https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/releases/2026/0/samples/export-requirements.txt
pip install -r export-requirements.txt PyGObject==3.50.0
A DL Streamer build that includes the
gvagenaielement is required.
Required arguments:
--prompt--video-pathor--video-url--model-idor--model-path
Example:
python3 vlm_alerts.py \
--video-url https://videos.pexels.com/video-files/2103099/2103099-hd_1280_720_60fps.mp4 \
--model-id OpenGVLab/InternVL3_5-2B \
--prompt "Is there a police car? Answer yes or no."
Optional arguments:
| Argument | Default | Description |
|---|---|---|
--device |
GPU |
Inference device |
--max-tokens |
20 |
Maximum tokens in the model response |
--frame-rate |
1.0 |
Frames per second passed to gvagenai |
--videos-dir |
./videos |
Directory for downloaded videos |
--models-dir |
./models |
Directory for exported models |
--results-dir |
./results |
Directory for output files |
results/<ModelName>-<video_stem>.jsonl
results/<ModelName>-<video_stem>.mp4
The .jsonl file contains one model response per processed frame and can be used to trigger downstream alerting logic.
To display all available arguments and defaults:
python3 vlm_alerts.py --help