GLaDOS can see and react to its environment using Apple's FastVLM running locally via ONNX Runtime.
Vision is a core input to the autonomy loop. When enabled:
- Camera captures frames at configured intervals
- Scene change detection identifies meaningful changes
- FastVLM generates descriptions of the current scene
- VisionUpdateEvent triggers the autonomy loop
- Main agent decides whether to act on what it sees
flowchart LR
A[Camera<br>Capture] --> B[Scene Change<br>Detection]
B --> C[FastVLM<br>Inference]
C --> D[VisionUpdate<br>Event]
D --> E[Autonomy Loop<br>Main Agent]
Vision takes priority over timer ticks - when vision is enabled, scene changes drive the autonomy loop instead of periodic timers.
The vision module is disabled by default. To enable it:
uv run glados start --config ./configs/glados_vision_config.yamlhuggingface-cli download onnx-community/FastVLM-0.5B-ONNX \
--local-dir models/Vision \
--include "onnx/vision_encoder_fp16.onnx" \
--include "onnx/embed_tokens_int8.onnx" \
--include "onnx/decoder_model_merged_q4f16.onnx" \
--include "config.json" \
--include "preprocessor_config.json" \
--include "tokenizer.json" \
--include "tokenizer_config.json" \
--include "README.md" \
--include "LICENSE"Or using the newer command:
hf download onnx-community/FastVLM-0.5B-ONNX \
--local-dir models/Vision \
--include "onnx/vision_encoder_fp16.onnx" \
--include "onnx/embed_tokens_int8.onnx" \
--include "onnx/decoder_model_merged_q4f16.onnx" \
--include "config.json" \
--include "preprocessor_config.json" \
--include "tokenizer.json" \
--include "tokenizer_config.json" \
--include "README.md" \
--include "LICENSE"This downloads the ONNX models (~640MB) to the default location.
vision:
enabled: true
model_dir: "models/Vision"
camera_index: 0
capture_interval_seconds: 5
resolution: 384
scene_change_threshold: 0.05
max_tokens: 200| Option | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable vision module |
model_dir |
string | "models/Vision" |
Path to FastVLM ONNX models |
camera_index |
int | 0 |
Camera device index |
capture_interval_seconds |
float | 5.0 |
Time between frame captures |
resolution |
int | 384 |
Scene-change detection resolution |
scene_change_threshold |
float | 0.05 |
Minimum change to trigger inference (0=always, 1=never) |
max_tokens |
int | 200 |
Maximum tokens in background description |
FastVLM provides 85x faster time-to-first-token compared to Ollama-based VLMs:
- Direct ONNX inference - no HTTP overhead
- Runs on CPU or CUDA - GPU acceleration when available
- Small footprint - ~640MB model files for 0.5B
- Frame differencing - skips unchanged scenes
The vision system maintains a single [vision] slot that's injected into the LLM context:
[vision] A person sitting at a wooden desk with a laptop. There is a coffee mug
to their left and a window showing daylight behind them.
This snapshot is updated whenever a new inference completes. The main agent sees the current scene in every request.
For specific visual questions (e.g., "What color is my shirt?"), the LLM can call the vision_look tool:
vision_look(prompt="Describe the person's clothing in detail")
This triggers:
- Fresh camera capture
- Custom VLM prompt for the specific question
- Detailed response returned to the LLM
Requires an LLM backend that supports tool calling.
Vision runs in a separate thread alongside other processors:
- Captures frames at
capture_interval_seconds - Compares frames using the configured threshold
- Runs VLM inference when scene changes detected
- Updates VisionState with latest description
- Emits VisionUpdateEvent to trigger autonomy
The thread is fully async and doesn't block voice or text processing.
Camera not opening:
- Check
camera_indexin config (try 0, 1, 2...) - Verify camera permissions
- Test with:
ls /dev/video*(Linux) or check System Preferences (macOS)
Models not found:
- Ensure models downloaded to
models/Vision/ - Check for
vision_encoder_fp16.onnx,embed_tokens_int8.onnx,decoder_model_merged_q4f16.onnx
Slow inference:
- Increase
capture_interval_seconds - Ensure CUDA available (
CUDAExecutionProvider) - Raise
scene_change_threshold(higher = fewer inferences)
Too many triggers:
- Increase
scene_change_threshold(0.1 or higher) - The threshold is a normalized difference score - adjust based on your environment
vision:
model_dir: "/path/to/custom/fastvlm"Remove the entire vision: section from your config, or set:
vision:
enabled: false| Aspect | Value |
|---|---|
| Model | Apple FastVLM-0.5B (ONNX) |
| Precision | fp16 + q4f16 mix |
| Architecture | Vision encoder + text decoder |
| Input | 1024x1024 RGB images (center-cropped) |
| Output | Natural language scene descriptions |
| Backend | ONNX Runtime (CPU/CUDA) |
| Integration | Same ONNX patterns as ASR/TTS |
- README - Full architecture diagram
- autonomy.md - How vision triggers the autonomy loop
- vision_config.py - Configuration source
- constants.py - Vision system prompts