Skip to content

Latest commit

 

History

History
97 lines (71 loc) · 3.24 KB

File metadata and controls

97 lines (71 loc) · 3.24 KB

Vision Language Model

License badge Board badge Board badge Board badge Board badge Board badge Board badge Board badge

Language badge Category badge

VLM submodule combines the abilities of vision and language models to handle both image and text on the **NXP i.MX9 applications processors.


SmolVLM256M-fp32-delivery.gif


Installation

1. Clone the repository (make sure git lfs is installed)

# Clone repository
git clone --single-branch -b release/v3.0 https://github.com/nxp-appcodehub/dm-eiq-genai-flow-demonstrator

2. Set up dependencies

cd vlm
./install.sh

Installation Warning:

The "Transformers" python package has a transitive dependency on "Pygments 2.19.2" package with a known vulnerability CVE-2026-4539 with no available fix at the time of this release. Please verify fix availability before integrating this dependency into your product.

Run VLM with Chat Interface GUI

Command to run the VLM and GUI.

# Run VLM
./launch.sh

It runs the chat_interface and the main vlm process. The first time you run the app it will take longer due to download of models.

  • -m, --model
    Specifies the VLM to use. Available models are:

    • smolvlm-256M
    • smolvlm-500M
  • -im, --input_image
    Path to the image to caption.

Default image delivery and industry in test/data

  • -p, --precision
    Precision of model.
    • fp32
    • q8

User can choose which part of the model is fp32 vs q8 by changing config.py

  • -g Use GUI. Default True.
#Example
 ./launch.sh -m smolvlm-500M -im path/to/your/image/image.png -p q8 -g
#Helper
 ./launch.sh --help

Run without GUI

It is posible to run the code without the GUI interface.

python3 -m vlm

Performance on i.MX95 (CPU)

i.MX95 Precision Vision Encoder Decoder (TTFT) Decoder
SmolVLM2-256M FP32 6.66s 0.84s 0.13s - 0.16s
INT8 3.31s 0.48s 0.08s - 0.09s
SmolVLM2-500M FP32 6.76s 1.98s 0.21s - 0.25s
INT8 3.34s 0.81s 0.12s - 0.19s

SmolVLM2-256 and SmolVLM2-500M share the same vision encoder so performance are the same.