VLM Alerts

This sample demonstrates an edge AI alerting pipeline using Vision-Language Models (VLMs).

It shows how to:

Download a VLM from Hugging Face
Convert it to OpenVINO IR using optimum-cli
Run inference inside a DL Streamer pipeline
Generate structured JSON alerts per processed frame
Produce MP4 output

Use Case: Alert-Based Monitoring

VLMs can help accurately detect rare or contextual events using natural language prompts — for example, detecting a police car in a traffic video. This enables alerting for events, like in prompts:

Is there a police car?
Is there smoke or fire?
Is a person lying on the ground?

Model Preparation

Any image-text-to-text model supported by optimum-intel can be used. Smaller models (1B-4B parameters) are recommended for edge deployment. For example, OpenGVLab/InternVL3_5-2B.

The script runs:

optimum-cli export openvino \
    --model <model_id> \
    --task image-text-to-text \
    --trust-remote-code \
    <output_dir>

Exported artifacts are stored under models/<ModelName>/. The export runs once and is cached. To skip export, pass --model-path directly.

Video Preparation

Similarly to model, provide either:

--video-path for a local file
--video-url to download automatically

Downloaded videos are cached under videos/.

Pipeline Architecture

The pipeline is built dynamically in Python using Gst.parse_launch.

graph LR
    A[filesrc] --> B[decodebin3]
    B --> C[gvagenai]
    C --> D[gvametapublish]
    D --> E[gvafpscounter]
    E --> F[gvawatermark]
    F --> G["encode (vah264enc + h264parse + mp4mux)"]
    G --> H[filesink]

Setup

Create and activate a virtual environment:

cd samples/gstreamer/python/vlm_alerts
python3 -m venv .vlm-venv
source .vlm-venv/bin/activate

Install dependencies:

curl -LO https://raw.githubusercontent.com/openvinotoolkit/openvino.genai/refs/heads/releases/2026/0/samples/export-requirements.txt
pip install -r export-requirements.txt PyGObject==3.50.0

A DL Streamer build that includes the gvagenai element is required.

Running

Required arguments:

--prompt
--video-path or --video-url
--model-id or --model-path

Example:

python3 vlm_alerts.py \
    --video-url https://videos.pexels.com/video-files/2103099/2103099-hd_1280_720_60fps.mp4 \
    --model-id OpenGVLab/InternVL3_5-2B \
    --prompt "Is there a police car? Answer yes or no."

Optional arguments:

Argument	Default	Description
`--device`	`GPU`	Inference device
`--max-tokens`	`20`	Maximum tokens in the model response
`--frame-rate`	`1.0`	Frames per second passed to `gvagenai`
`--videos-dir`	`./videos`	Directory for downloaded videos
`--models-dir`	`./models`	Directory for exported models
`--results-dir`	`./results`	Directory for output files

Output

results/<ModelName>-<video_stem>.jsonl
results/<ModelName>-<video_stem>.mp4

The .jsonl file contains one model response per processed frame and can be used to trigger downstream alerting logic.

Help

To display all available arguments and defaults:

python3 vlm_alerts.py --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM Alerts

Use Case: Alert-Based Monitoring

Model Preparation

Video Preparation

Pipeline Architecture

Setup

Running

Output

Help

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VLM Alerts

Use Case: Alert-Based Monitoring

Model Preparation

Video Preparation

Pipeline Architecture

Setup

Running

Output

Help