This document details the performance optimizations made to the EnhancedModelPicker modal.
Before optimization: ~1.8 seconds to load After optimization: ~0.3 seconds to load Improvement: 6x faster (1.5 seconds saved)
The EnhancedModelPicker was taking 1.5-2 seconds to open, creating a poor user experience. Profiling revealed the bottleneck:
Total load time: 1780ms
├─ Ollama discovery: 1563ms ← 88% of total time!
├─ Other discovery: 45ms
└─ UI rendering: 90ms
Ollama context fetching with include_context=True makes a sequential /api/show API call for each model:
- 56 Ollama models × ~30ms per call = ~1.5 seconds
This was happening every time the modal opened, even though context sizes rarely change.
Changed default behavior from include_context=True to include_context=False:
src/consoul/sdk/services/model.py (line 597):
# Before
models.extend(self.list_ollama_models(include_context=True))
# After
models.extend(self.list_ollama_models(include_context=False))src/consoul/tui/widgets/enhanced_model_picker.py (line 182):
# Before
local_models.extend(self.model_service.list_ollama_models(include_context=True))
# After
local_models.extend(self.model_service.list_ollama_models(include_context=False))Result: Ollama discovery dropped from 1563ms to 82ms (19x faster)
Before: All Ollama models showed context sizes (e.g., "131K", "32K") After: Ollama models show "?" for context (hidden in UI)
Justification:
- Context size is nice-to-have, not essential for model selection
- Users primarily choose models based on description, size, and capabilities
- 1.5 second delay was significantly impacting UX
- Context sizes are still available for MLX and HuggingFace models (read from config.json)
The LocalModelCard already handles missing context gracefully:
# Only shows context if available
if self.model.context_window and self.model.context_window != "?":
yield Label("Context:", classes="metadata-label")
yield Label(self.model.context_window, classes="metadata-value")Result: Ollama models simply don't show context row (clean UI)
| Step | Time | % of Total |
|---|---|---|
| Config loading | 12ms | 0.7% |
| Service creation | 76ms | 4.3% |
| Ollama discovery | 1563ms | 87.8% |
| MLX discovery | 14ms | 0.8% |
| GGUF discovery | 15ms | 0.8% |
| HuggingFace discovery | 11ms | 0.6% |
| Card rendering | 89ms | 5.0% |
| Total | 1780ms |
| Step | Time | % of Total | Change |
|---|---|---|---|
| Config loading | 13ms | 4.4% | - |
| Service creation | 73ms | 24.7% | - |
| Ollama discovery | 83ms | 28.1% | ✅ -95% |
| MLX discovery | 14ms | 4.7% | - |
| GGUF discovery | 15ms | 5.1% | - |
| HuggingFace discovery | 11ms | 3.7% | - |
| Card rendering | 87ms | 29.5% | - |
| Total | 295ms | ✅ -83% |
Tested on MacBook Pro M1 with:
- 56 Ollama models
- 10 MLX models
- 5 GGUF models
- 14 HuggingFace models
- Total: 85 model cards
# Quick benchmark
poetry run python -c "
import time
from consoul.config import load_config
from consoul.sdk.services.model import ModelService
config = load_config()
service = ModelService.from_config(config)
# Optimized
start = time.time()
models = service.list_ollama_models(include_context=False)
print(f'Optimized: {(time.time() - start)*1000:.1f}ms')
# Original
start = time.time()
models = service.list_ollama_models(include_context=True)
print(f'Original: {(time.time() - start)*1000:.1f}ms')
"Output:
Optimized: 84.2ms ← 17.6x faster
Original: 1479.3ms
Cache context sizes locally to enable fast lookups without API calls:
# ~/.consoul/cache/ollama_context_sizes.json
{
"llama3.2:latest": 131072,
"gemma3:1b": 32768,
...
}Benefit: Could re-enable context display without performance penalty
Fetch context sizes on-demand when user hovers/expands a card:
class LocalModelCard:
async def on_mount(self):
# Fetch context asynchronously after card renders
self.context = await fetch_context_async(self.model_id)Benefit: Fast initial load + eventual context display
Only render visible cards using Textual's virtualization:
# Instead of rendering all 85 cards
for model in all_models:
yield ModelCard(model)
# Only render ~10 visible cards
yield VirtualScroll(all_models, card_height=6)Benefit: Faster rendering, lower memory usage
Run model discovery in background thread:
async def discover_models_async():
# Show spinner
# Discover in background
# Update UI when completeBenefit: Modal opens instantly, models populate progressively
- ✅ Modal opens 6x faster (1.8s → 0.3s)
- ✅ Feels instant (< 300ms is perceived as instant)
- ✅ Smoother workflow (no noticeable delay)
⚠️ Trade-off: Ollama context sizes not shown (MLX/HF still have them)
- ✅ Clearer code - explicit about performance implications
- ✅ Documented - comments explain why include_context=False
- ✅ Flexible - can still enable for specific use cases
- ✅ Maintainable - simple change with big impact
Verify optimizations are working:
# Test discovery speed
poetry run python -c "
import time
from consoul.sdk.services.model import ModelService
service = ModelService.from_config()
start = time.time()
models = service._discover_local_models()
elapsed = (time.time() - start) * 1000
print(f'Total discovery: {elapsed:.1f}ms')
if elapsed > 500:
print('⚠️ WARNING: Discovery is slow!')
else:
print('✅ Discovery is fast!')
"Expected output:
Total discovery: 212.4ms
✅ Discovery is fast!
If context sizes are deemed essential, revert with:
# In model.py and enhanced_model_picker.py
models.extend(self.list_ollama_models(include_context=True))Note: This will restore 1.5s load time.
Better approach: Implement context caching (Future Optimization #1)
src/consoul/sdk/services/model.py:597- Ollama discovery optimizationsrc/consoul/tui/widgets/enhanced_model_picker.py:182- Modal optimizationsrc/consoul/tui/widgets/local_model_card.py:178- Context display logicsrc/consoul/ai/providers.py:118- Ollama API implementation
| Metric | Before | After | Improvement |
|---|---|---|---|
| Modal load time | 1780ms | 295ms | -83% |
| Ollama discovery | 1563ms | 83ms | -95% |
| API calls (Ollama) | 56 | 1 | -98% |
| User perceived speed | Slow | Instant | ✅ |
| Context info (Ollama) | ✅ | ❌ | Trade-off |
| Context info (MLX/HF) | ✅ | ✅ | Unchanged |
By removing expensive Ollama context fetching, we achieved a 6x performance improvement with minimal UX impact. The modal now opens in under 300ms, providing a smooth, responsive experience.
The trade-off (no Ollama context sizes) is acceptable because:
- Context is not essential for model selection
- MLX and HuggingFace models still show context (from config.json)
- 1.5 second delay was significantly impacting usability
- Context can be re-enabled later via caching (no API calls)
Recommendation: Ship this optimization. Consider implementing context caching in a future release for best of both worlds.