An MCP (Model Context Protocol) server that provides unified multimodal image analysis using vision models. Just upload an image with context about what you want to know, and the AI figures out the best approach.
- Unified Tool: One
analyze_image_with_contexttool for all image analysis needs - Smart Analysis: AI automatically determines whether to extract text, analyze diagrams, describe images, or summarize content
- Flexible Context: Provide natural language context about what you want to know
- Multiple Input Formats: Support for base64, file paths, and data URLs
- No Local Processing: Pure MCP-to-LLM architecture with minimal dependencies
visual-mcp/
├── src/visual_mcp/ # Source code
│ ├── __init__.py # Package init
│ ├── main.py # Entry point
│ └── server.py # MCP server implementation
├── tests/ # Test suite
│ ├── __init__.py
│ └── test_server.py
├── examples/ # Example scripts
│ └── example_usage.py
├── docs/ # Documentation
│ ├── CLAUDE_DESKTOP_SETUP.md
│ ├── PROJECT_SUMMARY.md
│ └── CRUSH.md
├── .env.example # Environment template
├── pyproject.toml # Project config
└── .gitignore # Git ignore rules
- Python 3.11 or later
- uv package manager
Using the Makefile (recommended):
# Install dependencies and setup development environment
make install
# For quick setup (alias for install)
make setup-devManual installation:
# Install dependencies
uv sync
# Install pre-commit hooks
uv run pre-commit installThe project includes a comprehensive Makefile to streamline development tasks:
# Install everything
make install
# Run the development server
make dev
# Run tests
make test
# Check code quality
make check-all# Code Quality Checks
make lint # Run linting with ruff
make format # Format code with ruff
make type-check # Run type checking with mypy
make check-all # Run all checks (lint + format + type-check)
# Testing
make test # Run tests with coverage
make test-watch # Run tests in watch mode (continuous testing)
# Server Operations
make dev # Run development MCP server
make run-server # Run production MCP server
# Build & Release
make build # Clean and build distribution package
make wheel # Quick wheel build for MCP testing
make clean # Remove all build artifacts and caches
# Examples
make example # Run example usage script# Run development server
uv run mcp dev src/visual_mcp/server.py
# Run tests with coverage
uv run pytest tests/ -v --cov=src/visual_mcp
# Code quality
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run mypy src/ tests/
# Build distribution
uv build
# Run example
uv run python examples/example_usage.pyFor detailed instructions on integrating with Claude Desktop, see docs/CLAUDE_DESKTOP_SETUP.md.
See docs/CLAUDE_DESKTOP_SETUP.md for detailed instructions.
Add this configuration to Claude Desktop:
{
"mcpServers": {
"visual-mcp": {
"command": "uvx",
"args": ["visual-mcp"],
"env": {
"GLM_API_KEY": "your-api-key-here",
"GLM_MODEL_NAME": "glm-4.5v"
}
}
}
}For development or using a specific version, build the wheel first:
# Build the wheel package
uv buildThen add this configuration to Claude Desktop:
{
"mcpServers": {
"visual-mcp": {
"command": "uvx",
"args": ["--from", "dist/visual_mcp-0.1.0-py3-none-any.whl", "visual-mcp"],
"env": {
"GLM_API_KEY": "your-api-key-here",
"GLM_MODEL_NAME": "glm-4.5v"
}
}
}
}The --from flag tells uvx to use the specific wheel file instead of downloading from PyPI. This is useful for:
- Testing local builds
- Using specific versions
- Development workflows
- Offline installations
GLM_API_KEY(Required): Your GLM API key from https://z.ai/model-apiGLM_MODEL_NAME(Optional): Model name to use (default:glm-4.5v)- Any OpenAI-compatible vision model is supported
- GLM example:
glm-4.5v(only GLM model with vision support) - Other examples:
gpt-4-vision-preview,gpt-4-turbo,claude-3-5-sonnet-20241022, etc.
Instead of multiple specialized tools, simply use one tool with natural language context:
Parameters:
image_data: Base64 encoded image data, file path, or data URLuser_context: What you want to know - be specific about your needsmax_tokens: Maximum tokens in response (default: 3000)
Text Extraction:
- "Extract and summarize all text in this document"
- "What does this contract say about termination clauses?"
- "Transcribe all handwritten text in this image"
Diagram Analysis:
- "Analyze this architecture diagram and explain the system flow"
- "Explain this UML diagram focusing on class relationships"
- "What's the logic shown in this flowchart?"
General Description:
- "Describe this photo focusing on people and setting"
- "What colors and composition do you see in this painting?"
- "Identify the main objects in this image"
Problem Solving:
- "What's wrong with this code screenshot?"
- "Identify safety issues in this workplace photo"
- "Find errors in this mathematical diagram"
Educational Content:
- "Explain this scientific diagram step by step"
- "Teach me about the components shown in this image"
- "Break down this complex visual for a beginner"
graph TB
subgraph "Client Applications"
A[Client App 1]
B[Client App 2]
C[Client App 3]
end
subgraph "Visual MCP Server"
D[MCP Server]
E[Image Upload Handler]
F[Vision Model]
G[Analysis Tools]
end
subgraph "External Services"
H[Vision Model API]
end
A --> D
B --> D
C --> D
D --> E
E --> F
F --> G
F --> H
G --> A
G --> B
G --> C
Set your vision model API key in environment variables:
# For GLM models, get your API key from https://z.ai/model-api
export GLM_API_KEY="your-api-key-here"
# For other OpenAI-compatible models, use the appropriate key name
# export OPENAI_API_KEY="your-openai-api-key-here"
# Optional: Set custom API base URL
# export GLM_API_BASE="https://open.bigmodel.cn/api/paas/v4"
# export OPENAI_API_BASE="https://api.openai.com/v1"