Date: 2025-07-07
Context: Migrating TUI Chat system from Ollama to vLLM for better multimodal performance
- ModelChecker Enhancement: Added vLLM support with proper loading status detection
- Provider Integration: Successfully implemented vLLM provider in VLMClient
- Configuration Migration: Updated config.json from Ollama to vLLM endpoints
- Performance Research: Documented 3.2x performance gains vs Ollama
- Learning: Even "small" multimodal models require significant VRAM
- Reality: Qwen2.5-VL-72B-AWQ (supposedly optimized) still failed on 2x24GB GPUs
- Takeaway: Always check actual memory requirements vs marketing claims
- Discovery: vLLM requires specific arguments for different model architectures
- Example: DeepSeek-VL2 needs
--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}' - Lesson: Docker-compose makes iterating on configs much easier than manual commands
- AWQ: Faster inference, better for production, limited model availability
- GPTQ: More models available, slightly slower inference
- Confusion: Why some models work with AWQ and others don't - seems architecture dependent
- Problem: vLLM container shows "running" but API isn't ready until model loads
- Solution: Implemented loading status checks in ModelChecker
- Learning: Health endpoints often lie - check
/v1/modelsfor actual readiness
- Confusion: Search results mention "16B" variant but official repo shows 3B/27B
- Question: Are there unofficial quantizations we should consider?
- Impact: Affects model selection strategy
- Observation: Same model available in AWQ, GPTQ, GGUF, EXL2 formats
- Question: Is it worth maintaining our own quantization pipeline?
- Trade-off: Time investment vs waiting for community quants
- Issue: Different vLLM versions support different model architectures
- Example: DeepSeek-VL2 support seems recent, not in all versions
- Strategy: Pin vLLM version in docker-compose for stability
- Phi-3.5-vision-instruct (4.2B): ~2-3 minutes
- Qwen2.5-VL-7B-AWQ: ~5-10 minutes
- Qwen2.5-VL-72B-AWQ: Failed (OOM)
- Idle: ~1-7GB across GPUs
- Loading: Memory gradually increases
- Serving: Stable higher usage (varies by model)
Decision: Start with working smaller models, scale up later
Rationale: Better to have working system than perfect model
Result: Phi-3.5 vision loads reliably vs Qwen failures
Decision: Use docker-compose for vLLM deployment
Rationale: Easier configuration iteration and environment reproduction
Result: Much faster debugging of model loading issues
Decision: Use 65% GPU memory utilization instead of 85%
Rationale: AWQ quantization + high utilization caused crashes
Result: More stable model loading
- Problem: Not all models work with all quantization methods
- Example: AWQ + bfloat16 incompatibility with some models
- Solution: Model-specific docker configurations
- Problem: Container "running" != API ready != model loaded
- Impact: TUI startup hangs without clear feedback
- Solution: Multi-layer status checking in ModelChecker
- Problem: Official docs often lag behind actual capabilities
- Example: DeepSeek-VL2 support exists but not well documented
- Strategy: Community resources (GitHub issues, HF discussions) more reliable
- Complete Phi-3.5 testing: Verify full camera tool integration
- Document working configurations: Create reliable deployment recipes
- Test DeepSeek-VL2: Try the promising MoE architecture
- Quantization Strategy: Build our own vs wait for community?
- Model Selection Criteria: Performance vs reliability vs memory usage?
- Multi-GPU Utilization: Are we properly leveraging both GPUs?
- "Running Multimodal AI Locally: Ollama vs vLLM Performance Comparison"
- "Docker-First Approach to Local AI Model Deployment"
- "The Hidden Costs of Large Vision-Language Models"
- Start Small, Scale Up: Working system beats perfect system
- Infrastructure First: Docker/compose saves massive debugging time
- Monitor Reality: Logs > status > documentation
- Community Knowledge: GitHub issues often more current than docs
- Memory Planning: Always allocate 30%+ buffer for model loading
Status: Migration 80% complete - API verified, integration testing pending
Next Session: Complete end-to-end camera tool testing with vLLM backend