Skip to content

Latest commit

 

History

History
126 lines (95 loc) · 5.35 KB

File metadata and controls

126 lines (95 loc) · 5.35 KB

vLLM Migration Journey: Learnings & Confusions

Date: 2025-07-07
Context: Migrating TUI Chat system from Ollama to vLLM for better multimodal performance

🎯 Core Objectives Achieved

  1. ModelChecker Enhancement: Added vLLM support with proper loading status detection
  2. Provider Integration: Successfully implemented vLLM provider in VLMClient
  3. Configuration Migration: Updated config.json from Ollama to vLLM endpoints
  4. Performance Research: Documented 3.2x performance gains vs Ollama

🧠 Key Learnings

1. Model Size vs GPU Memory Reality Check

  • Learning: Even "small" multimodal models require significant VRAM
  • Reality: Qwen2.5-VL-72B-AWQ (supposedly optimized) still failed on 2x24GB GPUs
  • Takeaway: Always check actual memory requirements vs marketing claims

2. vLLM Configuration Complexity

  • Discovery: vLLM requires specific arguments for different model architectures
  • Example: DeepSeek-VL2 needs --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'
  • Lesson: Docker-compose makes iterating on configs much easier than manual commands

3. AWQ vs GPTQ Trade-offs

  • AWQ: Faster inference, better for production, limited model availability
  • GPTQ: More models available, slightly slower inference
  • Confusion: Why some models work with AWQ and others don't - seems architecture dependent

4. Model Loading vs API Readiness

  • Problem: vLLM container shows "running" but API isn't ready until model loads
  • Solution: Implemented loading status checks in ModelChecker
  • Learning: Health endpoints often lie - check /v1/models for actual readiness

🤔 Ongoing Confusions

1. DeepSeek-VL2 Variant Naming

  • Confusion: Search results mention "16B" variant but official repo shows 3B/27B
  • Question: Are there unofficial quantizations we should consider?
  • Impact: Affects model selection strategy

2. Quantization Format Proliferation

  • Observation: Same model available in AWQ, GPTQ, GGUF, EXL2 formats
  • Question: Is it worth maintaining our own quantization pipeline?
  • Trade-off: Time investment vs waiting for community quants

3. vLLM Version Compatibility

  • Issue: Different vLLM versions support different model architectures
  • Example: DeepSeek-VL2 support seems recent, not in all versions
  • Strategy: Pin vLLM version in docker-compose for stability

📊 Performance Insights

Model Loading Times (Observed)

  • Phi-3.5-vision-instruct (4.2B): ~2-3 minutes
  • Qwen2.5-VL-7B-AWQ: ~5-10 minutes
  • Qwen2.5-VL-72B-AWQ: Failed (OOM)

Memory Usage Patterns

  • Idle: ~1-7GB across GPUs
  • Loading: Memory gradually increases
  • Serving: Stable higher usage (varies by model)

🛠️ Technical Decisions Made

1. Simplified Model Progression

Decision: Start with working smaller models, scale up later
Rationale: Better to have working system than perfect model
Result: Phi-3.5 vision loads reliably vs Qwen failures

2. Docker-Compose Over Manual Commands

Decision: Use docker-compose for vLLM deployment
Rationale: Easier configuration iteration and environment reproduction
Result: Much faster debugging of model loading issues

3. Conservative GPU Memory Allocation

Decision: Use 65% GPU memory utilization instead of 85%
Rationale: AWQ quantization + high utilization caused crashes
Result: More stable model loading

🚨 Critical Issues Encountered

1. Model Architecture Mismatches

  • Problem: Not all models work with all quantization methods
  • Example: AWQ + bfloat16 incompatibility with some models
  • Solution: Model-specific docker configurations

2. Loading Status Detection Gaps

  • Problem: Container "running" != API ready != model loaded
  • Impact: TUI startup hangs without clear feedback
  • Solution: Multi-layer status checking in ModelChecker

3. Documentation vs Reality

  • Problem: Official docs often lag behind actual capabilities
  • Example: DeepSeek-VL2 support exists but not well documented
  • Strategy: Community resources (GitHub issues, HF discussions) more reliable

🎯 Next Steps & Open Questions

Immediate Actions

  1. Complete Phi-3.5 testing: Verify full camera tool integration
  2. Document working configurations: Create reliable deployment recipes
  3. Test DeepSeek-VL2: Try the promising MoE architecture

Research Questions

  1. Quantization Strategy: Build our own vs wait for community?
  2. Model Selection Criteria: Performance vs reliability vs memory usage?
  3. Multi-GPU Utilization: Are we properly leveraging both GPUs?

Blog Post Opportunities

  1. "Running Multimodal AI Locally: Ollama vs vLLM Performance Comparison"
  2. "Docker-First Approach to Local AI Model Deployment"
  3. "The Hidden Costs of Large Vision-Language Models"

💡 Key Takeaways for Future Development

  1. Start Small, Scale Up: Working system beats perfect system
  2. Infrastructure First: Docker/compose saves massive debugging time
  3. Monitor Reality: Logs > status > documentation
  4. Community Knowledge: GitHub issues often more current than docs
  5. Memory Planning: Always allocate 30%+ buffer for model loading

Status: Migration 80% complete - API verified, integration testing pending
Next Session: Complete end-to-end camera tool testing with vLLM backend