Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 188 additions & 1 deletion src/arduino/app_bricks/vlm/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,190 @@
# Vision Language Model (VLM) Brick

Vision Language Model (VLM) Brick enables seamless integration with locally hosted VLMs for advanced AI capabilities in your Arduino projects.
The Vision Language Model (VLM) Brick provides a simple Python® interface for asking a locally hosted multimodal AI model questions about images. It lets Arduino® App Lab applications send text prompts together with image file paths or image bytes, then receive generated text from the model running on the board.

## Overview

The VLM Brick is designed for applications that need visual understanding without sending camera frames to the cloud. It wraps the local model runner behind the same chat-style API used by the LLM Bricks, while adding image input support through the `images` parameter.

Use this Brick when your application needs to describe a camera frame, inspect a scene, extract visual details, or combine an image with a natural-language instruction. For example, the Smart Mirror example captures a USB camera frame, sends it to `VisionLanguageModel`, and displays a short styling response in a web UI.

The Brick uses the model configured for `arduino:vlm` in Arduino App Lab.

## Features

- **Local visual AI**: Sends prompts to a VLM running on the board through the local model service.
- **Text plus image prompts**: Accepts a text message and one or more images as local file paths or raw image bytes.
- **Synchronous responses**: Uses `chat()` when the application needs the full answer before continuing.
- **Streaming responses**: Uses `chat_stream()` when the application should display generated text as it arrives.
- **Conversation memory**: Keeps recent chat history with `with_memory()` and can persist that history with a `MessagePersistence` backend.
- **Configurable generation**: Supports system prompts, temperature, token limits, timeouts, and model overrides.
- **Advanced access**: Exposes the underlying LangChain chat model through `get_client()` for custom integrations.

## Prerequisites

- A supported board with a local VLM model runner available. The current Brick configuration supports `ventunoq`.
- A compatible VLM model downloaded and configured in Arduino App Lab.
- The `arduino:vlm` Brick added to the application from App Lab.
- Image input as a valid file path or bytes. Camera frames should be encoded as JPEG bytes before calling the Brick.

**Note:** The VLM runs locally through the board model service, so cloud inference and cloud API keys are not required for normal use.

## Code example and usage

### Analyze an Image File

This example sends a local image file and waits for the complete model response.

```python
from arduino.app_bricks.vlm import VisionLanguageModel

vlm = VisionLanguageModel(
system_prompt="You are a concise visual assistant.",
temperature=0.4,
max_tokens=120,
)

response = vlm.chat(
message="Describe the main object in this image.",
images=["chair.jpg"],
)

print(response)
```

### Stream a Response

Use `chat_stream()` when a web UI or terminal interface should show text as the model generates it.

```python
from arduino.app_bricks.vlm import VisionLanguageModel

vlm = VisionLanguageModel()

for chunk in vlm.chat_stream(
message="Describe the image in one short paragraph.",
images=["chair.jpg"],
):
print(chunk, end="", flush=True)
```

### Analyze a Camera Frame

The Smart Mirror example uses this pattern: it captures the latest camera frame as JPEG bytes and sends the bytes directly to the VLM.

```python
from arduino.app_bricks.vlm import VisionLanguageModel

vlm = VisionLanguageModel(
system_prompt="You are a helpful visual assistant.",
temperature=0.4,
max_tokens=120,
)

def analyze_frame(frame_bytes: bytes) -> str:
return vlm.chat(
message="Describe the most important visual details in this frame.",
images=[frame_bytes],
).strip()
```

### Enable Conversation Memory

Use `with_memory()` when follow-up prompts should keep recent context. For persistent memory across restarts, pass a `MessagePersistence` implementation such as `SQLMessagePersistence`.

```python
from arduino.app_bricks.cloud_llm import SQLMessagePersistence
from arduino.app_bricks.dbstorage_sqlstore import SQLStore
from arduino.app_bricks.vlm import VisionLanguageModel

db = SQLStore("vlm_memory.db")
db.start()

vlm = VisionLanguageModel(
system_prompt="You remember relevant visual details.",
).with_memory(
max_messages=10,
persistence=SQLMessagePersistence(
sql_store=db,
thread_id="vlm-session",
),
)

print(vlm.chat("Remember what is in this image.", images=["chair.jpg"]))
print(vlm.chat("What object did I show you earlier?"))
```

## Configuration

The Brick is initialized with the following parameters:

| Parameter | Type | Default | Description |
| :-- | :-- | :-- | :-- |
| `api_key` | `str` | `os.getenv("LOCAL_LLM_API_KEY", "api_key")` | API key value passed to the local OpenAI-compatible model service. The default placeholder is enough for normal local use. |
| `model` | `str` | App Lab configured model | Local model identifier configured for `arduino:vlm` in App Lab. |
| `system_prompt` | `str` | `""` | System-level instruction that defines the assistant behavior. |
| `temperature` | `float` \| `None` | `0.7` | Controls randomness. Lower values are more deterministic; higher values are more varied. |
| `max_tokens` | `int` | `512` | Maximum number of tokens to generate in the response. |
| `timeout` | `int` \| `None` | `None` | Maximum time in seconds to wait for a response. |
| `tools` | `list[Callable]` | `None` | Optional LangChain-compatible tool functions available to the model. |
| `**kwargs` | `dict` | `{}` | Additional keyword arguments passed to the underlying model constructor. |

The Brick configuration declares that `arduino:vlm` requires a local model service and a compatible model:

```yaml
id: arduino:vlm
name: Vision Language Model (VLM)
requires_services: ["arduino:genie"]
requires_model: true
model: genie:qwen2_5_vl_7b_instruct
supported_boards: ["ventunoq"]
```

## Methods

- **`chat(message, images=None)`**: Sends a prompt and optional images, then returns the complete generated response as a string.
- **`chat_stream(message, images=None)`**: Sends a prompt and optional images, then yields generated text chunks as they arrive.
- **`stop_stream()`**: Requests cancellation of the active streaming response.
- **`with_memory(max_messages=0, persistence=None)`**: Enables conversational memory for the instance. Pass `max_messages=0` to disable history.
- **`clear_memory()`**: Clears the active conversation history.
- **`get_client()`**: Returns the underlying LangChain `BaseChatModel` instance.

## Image Inputs

The `images` argument accepts a list containing:

- File paths, such as `"chair.jpg"`.
- Raw image bytes, such as a JPEG frame captured from a camera.

When an image path is used, the file must exist in the application runtime environment. When bytes are used, make sure they represent an encoded image format that the model can interpret. For camera applications, convert frames to JPEG bytes before passing them to the Brick.

## Example Application

The `app-bricks-examples` repository includes a Smart Mirror example for Arduino® VENTUNO™ Q that uses the VLM Brick with a web UI and USB camera:

```python
result = vlm.chat(
message=prompt.build_user_prompt(USER_PROMPT_TEMPLATE),
images=[frame],
).strip()
```

In that example, the application continuously captures camera frames, keeps the latest frame in memory, and sends one frame to the VLM when the user starts a scan from the browser interface.

## Troubleshooting

### Model not found

**Fix:** Verify that the selected VLM model is downloaded and available in App Lab. If you override the model, make sure the model identifier matches a model exposed by the local `genie` service.

### Empty or generic responses

**Fix:** Use a more specific prompt, lower the temperature for more consistent output, and make sure the image is clear and well lit. For camera frames, verify that the bytes are encoded image data, not raw pixel data.

### Image file not found

**Fix:** Use a path that exists inside the application container, or pass image bytes instead of a file path.

### Response generation fails with a memory error

**Fix:** Reduce `max_tokens`, close other running applications, and restart the app. The VLM can require significant memory during model loading and inference.
Loading