VLM submodule combines the abilities of vision and language models to handle both image and text on the **NXP i.MX9 applications processors.
# Clone repository
git clone --single-branch -b release/v3.0 https://github.com/nxp-appcodehub/dm-eiq-genai-flow-demonstratorcd vlm
./install.shInstallation Warning:
The "Transformers" python package has a transitive dependency on "Pygments 2.19.2" package with a known vulnerability CVE-2026-4539 with no available fix at the time of this release. Please verify fix availability before integrating this dependency into your product.
Command to run the VLM and GUI.
# Run VLM
./launch.shIt runs the chat_interface and the main vlm process. The first time you run the app it will take longer due to download of models.
-
-m,--model
Specifies the VLM to use. Available models are:smolvlm-256Msmolvlm-500M
-
-im,--input_image
Path to the image to caption.
Default image delivery and industry in test/data
-p,--precision
Precision of model.fp32q8
User can choose which part of the model is fp32 vs q8 by changing config.py
-gUse GUI. Default True.
#Example
./launch.sh -m smolvlm-500M -im path/to/your/image/image.png -p q8 -g#Helper
./launch.sh --helpIt is posible to run the code without the GUI interface.
python3 -m vlm| i.MX95 | Precision | Vision Encoder | Decoder (TTFT) | Decoder |
|---|---|---|---|---|
| SmolVLM2-256M | FP32 | 6.66s | 0.84s | 0.13s - 0.16s |
| INT8 | 3.31s | 0.48s | 0.08s - 0.09s | |
| SmolVLM2-500M | FP32 | 6.76s | 1.98s | 0.21s - 0.25s |
| INT8 | 3.34s | 0.81s | 0.12s - 0.19s |
SmolVLM2-256 and SmolVLM2-500M share the same vision encoder so performance are the same.
