This sample demonstrates the Large Vision Model (CLIP) pipeline constructed via gst-launch-1.0 command-line utility. It allows the extraction of image embeddings (CLS tokens) for each frame using the Visual Transformer.
The sample utilizes GStreamer command-line tool gst-launch-1.0 which can build and run a GStreamer pipeline described in a string format.
The string contains a list of GStreamer elements separated by an exclamation mark !, each element may have properties specified in the format property=value.
This sample builds a GStreamer pipeline of the following elements:
filesrc,urisourcebin, orv4l2srcfor input from file/URL/web-cameradecodebin3for video decodingvideoconvertfor converting video frames into different color formatsvideoscalefor scaling video framesvapostprocfor post-processing (used in GPU pipeline)gvainferencefor running inference using the CLIP Vision Transformer modelgvametaconvertfor converting metadata to JSON formatgvametapublishfor publishing metadata to a filegvafpscounterfor measuring FPS (used in FPS mode)fakesinkfor discarding the output
The sample uses the clip-vit-large-patch14, clip-vit-base-patch16 or clip-vit-base-patch32 model. The necessary conversion to the OpenVINO™ format is performed by the download_public_models.sh script located in the samples directory.
export MODELS_PATH="$HOME"/models
../../../download_public_models.sh clip-vit-large-patch14
./generate_frame_embeddings.sh [INPUT] [DEVICE] [OUTPUT] [MODEL]The sample takes four command-line optional parameters:
-
[INPUT] to specify the input source.
The input could be:- local video file
- web camera device (e.g.,
/dev/video0) - RTSP camera (URL starting with
rtsp://) or other streaming source (e.g., URL starting withhttp://)
If the parameter is not specified, the sample by default streams a video example from an HTTPS link (utilizing theurisourcebinelement), so it requires an internet connection.
-
[DEVICE] to specify the device for inference.
You can choose eitherCPUorGPU. -
[OUTPUT] to choose between file output mode and FPS throughput mode:
- json - output to a JSON file (default)
- fps - FPS only
-
[MODEL] to specify the model for inference:
- clip-vit-large-patch14 (default)
- clip-vit-base-patch16
- clip-vit-base-patch32
The sample:
- prints the
gst-launch-1.0full command line into the console - starts the command and either publishes metadata to a file or prints out FPS if you set OUTPUT=fps
To run the sample with default values:
./generate_frame_embeddings.sh
To specify a source file, device, and output:
./generate_frame_embeddings.sh /path/to/video.mp4 GPU fps
To specify a URL, device, output and model:
./generate_frame_embeddings.sh https://example.com/video.mp4 CPU json clip-vit-large-patch14
To specify a video device, device, output and model:
./generate_frame_embeddings.sh /dev/video0 CPU fps clip-vit-base-patch-16