Skip to content

kifirkin/pi-vision

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pi-vision

A pi extension that provides a dedicated analyze_image tool for vision-based image analysis using configurable vision models.

Why?

Many models don't have vision capabilities (e.g., GLM text models). This extension provides an analyze_image tool that agents can call on-demand to analyze images using separate vision-capable models like glm-4.6v or Claude Sonnet.

Features

  • Dedicated analyze_image tool: Agents can call this tool on demand when they need to understand image content
  • Configurable vision models: Define multiple vision provider/model combinations in settings
  • Automatic model selection: Uses the configured default or first available model
  • Image preview for users: Shows image in TUI during analysis (display only, not stored in session)
  • Session-efficient: Returns text-only results to avoid flooding session with base64 image data
  • Image metadata: Displays path, size, and MIME type
  • Manual analysis command: /analyze-image <path> for interactive image analysis
  • Runtime model switching: /vision-model to view and switch between configured models

Installation

Install globally from git:

pi install git:github.com/kifirkin/pi-vision

Or install for a specific project (writes to .pi/settings.json):

pi install -l git:github.com/kifirkin/pi-vision

To try it without installing:

pi -e git:github.com/kifirkin/pi-vision

Configuration (Required)

You must configure vision models before using this extension. Add configuration to your settings file:

Global settings (~/.pi/agent/settings.json):

{
  "visionModels": [
    { "provider": "zai", "model": "glm-4.6v" },
    { "provider": "anthropic", "model": "claude-sonnet-4-5" },
    { "provider": "openai", "model": "gpt-4o" }
  ],
  "visionModel": "zai/glm-4.6v"
}

Project settings (.pi/settings.json):

Project settings override global settings:

{
  "visionModels": [
    { "provider": "zai", "model": "glm-4.6v" }
  ]
}

Configuration options:

Setting Type Required Description
visionModels Array Yes List of {provider, model} objects for vision analysis
visionModel String No Default model to use, format: "provider/model". If not set, uses first model in list.
maxImageSizeMB Number No Warn when images exceed this size in MB (default: 5). Analysis still proceeds, but may be slower.

Image Size Considerations

Vision models automatically resize images, but large files affect performance:

Size Recommendation
< 1MB ✅ Optimal - fast analysis
1-5MB ✅ Good - standard for screenshots
5-10MB ⚠️ Slow - consider resizing first
> 10MB ⚠️ Very slow - strongly recommend resizing

Tips for large images:

  • The extension warns when images exceed maxImageSizeMB (default: 5MB)
  • Vision models typically resize to ~1024px or ~2000px on longest side anyway
  • For 4K screenshots, consider cropping to the relevant area
  • PNG screenshots can often be converted to JPEG for photos (not diagrams/code)

Usage

As an Agent Tool

The LLM can call the analyze_image tool when needed:

analyze_image({"path": "./screenshot.png"})

The tool will:

  1. Check if the file is a supported image format
  2. Show progress with image metadata (path, size, MIME type)
  3. Display image preview in TUI for user (during analysis only)
  4. Use the configured vision model to analyze the image
  5. Return text-only analysis result (stored in session)
  6. Include image path in tool details for traceability

Session Behavior

Important design decision for token efficiency:

  • During analysis: User sees image preview in TUI via onUpdate
  • After analysis: Session stores only text analysis + image path reference
  • Image data: NOT stored in session history to avoid token flooding (~4800 tokens per image)

This ensures:

  • ✅ Users can see images during analysis
  • ✅ Session has trace of what images were analyzed (via path in details)
  • ✅ No base64 image data in session = no token cost for subsequent LLM calls
  • ✅ Compaction not impacted by large image payloads

Example stored in session:

[Image analyzed: /path/to/screenshot.png]

{comprehensive analysis text...}

With tool details:

{
  "path": "/path/to/screenshot.png",
  "visionModel": { "provider": "zai", "model": "glm-4.6v" },
  "imageAnalyzed": true
}

Commands

/analyze-image <path>

Manually analyze an image:

/analyze-image ./screenshot.png

/vision-model [provider/model]

View or change the active vision model:

/vision-model                    # Show active and available models
/vision-model zai/glm-4.6v      # Switch to specific model

Supported Image Formats

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • GIF (.gif)
  • WebP (.webp)

License

MIT License - see LICENSE file

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 100.0%