This guide covers different VLM backend options to power your Live VLM WebUI.
Best for: Quick start, easy model management, beginners
# Install ollama from https://ollama.ai/download
# Pull a vision model
ollama pull llama3.2-vision:11b
# Start ollama server
ollama serveRecommended Models:
llama3.2-vision:11b- Good balance of quality and speedllava:7b- Faster, lighter modelllava:13b- Higher quality
Best for: Production deployments, high throughput, GPU optimization
# Install vLLM
pip install vllm
# Start vLLM server with a vision model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-11B-Vision-Instruct \
--port 8000For NVIDIA Jetson Thor, use the official vLLM container:
# Pull and run vLLM container
docker run --rm -it \
--network host \
--shm-size=16g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--runtime=nvidia \
--name=vllm \
-v $HOME/data/models/huggingface:/root/.cache/huggingface \
-v $HOME/data/vllm_cache:/root/.cache/vllm \
nvcr.io/nvidia/vllm:25.10-py3
# Inside the container:
vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-codeRecommended Vision Models:
# Smaller, faster (4B parameters)
vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code
# Llama 3.2 Vision (11B parameters, higher quality)
vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --trust-remote-code
# Qwen2-VL (7B parameters, good balance)
vllm serve Qwen/Qwen2-VL-7B-Instruct --trust-remote-codeBest for: Complex prompts, structured outputs, research
# Install SGLang
pip install "sglang[all]"
# Start SGLang server
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
--port 30000Best for: No local GPU, cloud-based inference, instant access
1. Get your API Key:
- Visit NVIDIA API Catalog
- Sign in with your NVIDIA account (free)
- Navigate to a vision model (e.g., Llama 3.2 Vision)
- Click "Get API Key"
2. Test with curl:
curl -X POST "https://ai.api.nvidia.com/v1/gr/meta/llama-3.2-90b-vision-instruct/chat/completions" \
-H "Authorization: Bearer nvapi-YOUR_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.2-90b-vision-instruct",
"messages": [
{
"role": "user",
"content": "What is in this image? <img src=\"data:image/png;base64,...\" />"
}
],
"max_tokens": 512
}'3. Configure in WebUI:
- API Base URL:
https://ai.api.nvidia.com/v1/gr - API Key:
nvapi-YOUR_KEY_HERE - Model:
meta/llama-3.2-90b-vision-instruct
Available Vision Models:
meta/llama-3.2-90b-vision-instruct- Highest quality, 90B parametersmeta/llama-3.2-11b-vision-instruct- Good balancemicrosoft/phi-3-vision-128k-instruct- Fast, 4.2B parametersnvidia/neva-22b- NVIDIA's vision model
Notes:
- Free tier available with rate limits
- No local GPU required
- Keep your API key secure!
| Backend | Setup Difficulty | Speed | Quality | GPU Required |
|---|---|---|---|---|
| Ollama | ⭐ Easy | 🟢 Fast | 🟢 Good | Yes (local) |
| vLLM | ⭐⭐ Medium | 🟢🟢 Fastest | 🟢🟢 Excellent | Yes (local) |
| SGLang | ⭐⭐ Medium | 🟢 Fast | 🟢🟢 Excellent | Yes (local) |
| NVIDIA API | ⭐ Easy | 🟡 Medium | 🟢🟢🟢 Best | No |
| Backend | Default Port | API Base URL |
|---|---|---|
| Ollama | 11434 | http://localhost:11434/v1 |
| vLLM | 8000 | http://localhost:8000/v1 |
| SGLang | 30000 | http://localhost:30000/v1 |
| NVIDIA API | - | https://ai.api.nvidia.com/v1/gr |