Version: 1.0 MVP
Date: December 28, 2024
Author: Christian
Status: Planning Phase - Ready for Implementation
License: MIT
Buddy is a proactive AI desktop assistant for Windows 11 that enables natural voice interaction with your computer while maintaining awareness of your work context through intelligent screenshot analysis. Unlike reactive chatbots, Buddy observes your workflow and offers help proactively when patterns suggest you're stuck or could benefit from assistance.
- Local-first architecture with optional cloud enhancement
- Proactive assistance rather than reactive queries
- Voice-controlled for minimal workflow interruption
- Privacy-configurable to user preference
- Provider-agnostic - not locked to any AI service
- Talk to your desktop naturally while working
- Never lose context when switching tasks or interrupted
- Get help before you realize you need it (proactive)
- Control computer hands-free while typing
- Privacy-respecting with local-first processing
Desktop Computing Limitations:
- No natural language interface for desktop
- Constant context switching to AI tools breaks flow
- Must manually provide context every time
Voice Assistant Disconnect:
- Alexa/Google Home have no screen awareness
- Not useful for actual work tasks
- Disconnected from desktop applications
Context Loss:
- Mental context lost when switching apps
- Interruptions cause complete context loss
- Hard to resume where you left off
- No system tracks your work state
Reactive vs Proactive:
- All AI tools wait for explicit queries
- No pattern detection (stuck, repetitive actions)
- Missed opportunities for proactive help
Knowledge workers spending 6+ hours daily at desktop. Developers, writers, researchers who value efficiency and want AI help without breaking flow.
Two-Tier Architecture:
Frontend (WinUI3/C#):
- System tray integration
- Toast notifications
- Chat panel
- Settings interface
Backend (Python):
- Flask REST API (localhost:5000)
- Screenshot capture service
- AI provider management
- Voice input/output
- Context management
- Pattern detection
- System control
Communication: REST API (JSON over HTTP localhost only)
Local-First: Primary processing on-device, cloud optional
Modular Providers: Abstract interfaces, easy to add new AI services
Privacy by Design: User controls everything
Resilient: Graceful degradation, no single point of failure
Performance-Conscious: Minimal idle resource usage
Purpose: Capture visual context of desktop activity
Key Features:
- Multi-monitor support with per-monitor config
- Configurable frequency (default 30 seconds)
- JPEG compression (target <500KB per image)
- Active window detection
- Privacy filtering (app blacklist, monitor exclusion)
Capture Modes:
- Active monitor only (default)
- All monitors
- Selective monitors
- Window-specific
Privacy Filtering:
- Application blacklist (KeePass, banking apps, etc.)
- Monitor exclusion list
- Time-based pausing
- Manual hotkey pause (Win+Shift+P)
Technical:
- Library:
mss(Python) for screenshots - Format: JPEG quality 85%
- Max resolution: 1920x1080 per monitor
- Change detection via perceptual hashing
- Background thread capture
Purpose: Maintain intelligent compressed representation of work history
Three-Tier Storage:
Immediate Context (High Fidelity):
- Last 2-3 screenshots with full images
- Timespan: ~5 minutes
- For: Real-time queries, proactive notifications
Recent Context (Medium Fidelity):
- Last 10-20 screenshots as summaries
- Timespan: 30-60 minutes
- For: Session continuity, task resumption
Session Context (Low Fidelity):
- High-level narrative of work session
- Timespan: Hours to days
- For: Long-term patterns, summaries
Pattern Detection:
- Stuck detection (same screen 15+ min)
- Context switch detection (app/task changes)
- Repetitive action detection
- Research pattern detection
- Error pattern detection
Purpose: Abstract interface to multiple AI services with fallback chains
Supported Providers:
Vision Analysis:
- DeepSeek-VL (local, free, good quality)
- Claude Sonnet 4 (cloud, excellent quality, ~$0.015/image)
- OpenAI GPT-4o Vision (cloud alternative)
- LLaVA (local alternative)
Text-to-Speech:
- Piper (local, free, good quality)
- Coqui TTS (local alternative)
- ElevenLabs (cloud, excellent, $5-22/month)
- Azure Neural TTS (cloud, cost-effective)
Speech-to-Text:
- Whisper (local, free, excellent)
- Azure Speech (cloud, real-time capable)
- Deepgram (cloud, best cost/performance)
Provider Selection:
- Priority-based with fallback chains
- Quota tracking and automatic switching
- Quality vs cost optimization
- User-configurable preferences
Purpose: Hands-free voice command capture and transcription
Features:
- Wake word detection (optional, using Porcupine)
- Push-to-talk mode (hotkey: Ctrl+Space)
- Speech-to-text via Whisper (local)
- Audio source filtering (avoid YouTube triggers)
- Noise reduction and enhancement
Voice Activity Detection:
- Trim silence from recordings
- Detect when actually speaking
- Improve transcription accuracy
Performance:
- Whisper base model: 1-2s latency
- 16kHz sample rate
- GPU acceleration if available
Purpose: Natural-sounding voice responses
Features:
- Multiple TTS provider support
- Provider fallback chain
- Voice personality configuration
- Response queuing (prevent overlap)
- Volume and speech rate control
Primary: Piper (local, fast, unlimited) Fallback: ElevenLabs or Azure for higher quality
Response Queuing:
- High priority (user queries) can interrupt
- Medium/low priority queued
- Prevent overlapping audio
Purpose: Decide when to speak up vs stay quiet
Detection Patterns:
Stuck Detection:
- Same screen >15 minutes
- Same error message repeatedly
- No progress visible
- Notification: "You've been stuck on that error for 20 minutes. Want help?"
Context Switch Detection:
- Major app/task change
- Offer to remember previous context
- Remind when returning
Repetitive Action Detection:
- Same action 3+ times
- Suggest automation
- Offer shortcuts
Research Pattern Detection:
- Multiple sources on same topic
- Offer to synthesize findings
- Create reference document
Rate Limiting:
- Max 2 notifications per hour
- Min 15 minutes between notifications
- Learn from user feedback
Purpose: Voice-controlled system operations
Commands:
- Window management: maximize, minimize, close
- App launching: "open Chrome", "open VS Code"
- Web navigation: "open Reddit", "search for X"
- System functions: mute, volume, shutdown, sleep
- Screen control: screenshot, brightness
Safety:
- Confirmations for destructive operations
- Rate limiting on repeated commands
- Command logging for audit
Complete system configuration in single JSON file:
{
"privacy": {
"mode": "balanced",
"screenshot": {
"capture_active_only": true,
"exclude_monitors": [],
"excluded_apps": ["KeePass.exe", "*Banking*"],
"capture_frequency_seconds": 30
},
"retention": {
"screenshot_retention_hours": 24,
"summary_retention_days": 30
}
},
"vision": {
"providers": [
{"name": "deepseek_local", "priority": 1, "enabled": true},
{"name": "claude", "priority": 2, "enabled": true}
]
},
"voice_input": {
"mode": "wake_word",
"wake_word": {"enabled": true, "keyword": "buddy"}
},
"voice_output": {
"default_voice": {
"provider": "piper_local",
"voice": "en_US-amy-medium",
"style": "neutral"
},
"providers": [
{"name": "piper_local", "priority": 1}
]
},
"proactivity": {
"stuck_detection": {"threshold_minutes": 15},
"rate_limiting": {"max_notifications_per_hour": 2}
}
}- Voice wake word – the backend ignores transcripts that do not start with “Buddy…”, ensuring hands-free commands remain intentional.
- Custom voice output – adjust
voice_output.default_voiceto point at Piper, ElevenLabs, or other providers (voice/styling) once the TTS integrators are added; the configuration already propagates to the placeholder synthesizer.
Buddy automatically sorts LLM providers so zero-cost/local options (e.g., DeepSeek-VL, Piper) are tried first. If a provider fails (missing API key, payment required, rate limits, etc.) the registry seamlessly falls back to the next available option without dropping the user request.
- Use
python backend/download_models.py --listto view required local models (DeepSeek-VL for screenshots, Whisper Tiny for STT, etc.). - Run
python backend/download_models.py deepseek-vl-litebefore enabling fully-local vision analysis. Add--offlineto create placeholder folders when downloading manually.
- Run
./install.sh(macOS/Linux/WSL) to provision the Python virtualenv, install backend deps, restore local models, and pre-restore the WinUI3 frontend project. The script will remind you to install the Windows.NETtooling if it is missing. - Windows developers can run the same script inside WSL or execute the individual steps manually (
pip install -r backend/requirements.txt,python backend/download_models.py ...,dotnet restore frontend/BuddyApp/BuddyApp.csproj).
Output Target: Buddy ships as a WinUI3 desktop application for Windows 11. The backend runs locally (Python) while the frontend builds into a WinUI3
.msix/packaged app. Keep this in mind when planning deployment or installer work.
- Run
pytest backend/tests testsfrom an activated virtual environment to execute backend unit tests plus the new high-level wake-word/provider tests undertests/.
Paranoid: Everything local, nothing to cloud Balanced: Local first, cloud for complex queries (default) Permissive: Cloud first for quality, local fallback
- Framework: WinUI3 (C# / .NET 8)
- UI: Native Windows 11 controls
- Notifications: Windows Toast API
- Language: Python 3.11+
- Web Framework: Flask
- Database: SQLite
- Async: Threading for background tasks
- Vision: DeepSeek-VL (local), Claude API (cloud)
- STT: Whisper (local)
- TTS: Piper (local), ElevenLabs (cloud)
- Wake Word: Picovoice Porcupine
- Screenshot:
mss - Audio:
sounddevice - Image:
Pillow - Windows APIs:
pywin32 - Perceptual hash:
imagehash
Core Features:
- Screenshot capture (multi-monitor configurable)
- Basic context (last 3-5 screenshots)
- Vision analysis (DeepSeek OR Claude)
- Voice input (Whisper, push-to-talk)
- Voice output (Piper)
- Simple proactive notifications (stuck detection)
- Basic system commands
- WinUI3 interface (tray, toasts, chat)
- JSON configuration
- Privacy controls
Deliverables:
- Working application
- 2-minute demo video
- GitHub README
Success:
- Runs stably 1+ hour
- Voice commands work
- Can detect stuck pattern
Enhancements:
- Multiple provider support with fallback
- Improved context compression
- Wake word detection
- More proactive patterns
- Extended system commands
- Settings UI
- Usage dashboard
- Installer
Advanced:
- Pattern learning from feedback
- Research synthesis
- Custom commands
- Browser extension integration
- Productivity analytics
Major Features:
- Vector database for semantic search
- Long-term memory (months/years)
- VS Code extension
- Multi-language support
- Team features (optional)
- Mobile companion app
Idle:
- RAM: <100MB
- CPU: <1%
Active:
- RAM: <500MB
- CPU: 10-25%
- Voice command to action: <3 seconds
- Wake word to listening: <500ms
- Screenshot to analysis: <5 seconds
- UI interaction: <100ms
- Installation: ~2GB (with models)
- Daily usage: ~50MB
- Monthly: ~1.5GB
- Local-first processing (default)
- User controls all data transmission
- No telemetry without consent
- Clear visibility into behavior
- Easy pause/disable
- API keys in Windows Credential Manager
- Database encryption (optional)
- HTTPS for all cloud calls
- Local API localhost-only
- Screenshot secure deletion
- GDPR: Export/delete all data
- Clear retention policies
- Audit trail of API calls
- Opt-in for cloud processing
GET /api/status System health check
POST /api/query
Send user query
{
"query": "What was I working on?",
"include_screenshot": true
}GET /api/notifications Poll for proactive notifications
POST /api/control Execute system command
{
"command": "open_application",
"target": "chrome"
}GET /api/context Retrieve context summary
POST /api/feedback Submit feedback on notification
- Screenshot capture
- Context compression
- Pattern detection
- Provider selection
- Command parsing
- End-to-end voice query
- Proactive notification flow
- Provider fallback
- Context persistence
- Response latency
- Memory usage over time
- Database query performance
- 10-20 beta users
- 2-week testing period
- Weekly feedback surveys
Minimum:
- Windows 11 (64-bit)
- 8GB RAM
- 5GB disk space
- Microphone, speakers
Recommended:
- 16GB RAM
- NVIDIA GPU (4GB+ VRAM)
- 10GB disk space
Installer Method:
- Download BuddySetup.exe
- Run installer
- Configure privacy mode
- Launch Buddy
From Source:
- Clone repository
- Install dependencies:
pip install -r requirements.txt - Download models:
python download_models.py - Build frontend:
cd frontend && dotnet build - Run backend:
python backend/main.py - Run frontend:
dotnet run --project frontend
Transient: Network timeouts, rate limits
- Retry with backoff, fall back to alternative
Configuration: Invalid API keys, malformed config
- Fall back to defaults, notify user
Resource: Out of disk, out of memory
- Graceful degradation, cleanup old data
User: Unrecognized command, ambiguous request
- Clear error message, suggest corrections
- Automatic service restart on crash
- Database backup before operations
- Resume from last known state
- Export data as last resort
- Runs stably 1+ hour
- <5s response latency
- <500MB RAM
- 90%+ voice accuracy
- Runs stably 8+ hours
- <3s response latency
- Provider fallback 100%
- No crashes in normal use
- 1000+ active users (Year 1)
- 70%+ weekly retention
- Users save 30+ min/day
- NPS >7
Buddy fundamentally changes how knowledge workers interact with desktops. By combining proactive AI, voice interaction, and context awareness, Buddy eliminates context-switching friction and enables hands-free productivity.
The local-first architecture respects privacy while provider-agnostic design prevents vendor lock-in. Users optimize for their priorities: privacy, quality, or cost.
This specification provides a complete blueprint for building Buddy, with sufficient detail to guide implementation while remaining flexible for learnings during development.
- Set up development environment
- Build MVP core features (weeks 1-2)
- Test with real usage
- Iterate based on feedback
- Launch beta program
- Polish for V1.0 release
The goal is shipping a useful product that solves a real problem. Buddy will evolve based on real user needs, starting with the core vision: natural conversation with your desktop while maintaining work context awareness.
Document Status: Planning Complete - Ready for Implementation
Last Updated: December 28, 2024