Skip to content

Buddy is a local-first, WinUI 3 desktop assistant for Windows 11 that listens for “Buddy…” voice \commands, watches your workflow via privacy-aware screenshots, and offers proactive help before you even ask. The Python backend captures context, routes commands through a smart provider stack (free on-device models first, falling back to paid)

Notifications You must be signed in to change notification settings

cschladetsch/PyAiBuddyWin11

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Buddy

Version: 1.0 MVP
Date: December 28, 2024
Author: Christian
Status: Planning Phase - Ready for Implementation
License: MIT


Executive Summary

Buddy is a proactive AI desktop assistant for Windows 11 that enables natural voice interaction with your computer while maintaining awareness of your work context through intelligent screenshot analysis. Unlike reactive chatbots, Buddy observes your workflow and offers help proactively when patterns suggest you're stuck or could benefit from assistance.

Core Philosophy

  • Local-first architecture with optional cloud enhancement
  • Proactive assistance rather than reactive queries
  • Voice-controlled for minimal workflow interruption
  • Privacy-configurable to user preference
  • Provider-agnostic - not locked to any AI service

Key Value Propositions

  1. Talk to your desktop naturally while working
  2. Never lose context when switching tasks or interrupted
  3. Get help before you realize you need it (proactive)
  4. Control computer hands-free while typing
  5. Privacy-respecting with local-first processing

Problem Statement

Current Pain Points

Desktop Computing Limitations:

  • No natural language interface for desktop
  • Constant context switching to AI tools breaks flow
  • Must manually provide context every time

Voice Assistant Disconnect:

  • Alexa/Google Home have no screen awareness
  • Not useful for actual work tasks
  • Disconnected from desktop applications

Context Loss:

  • Mental context lost when switching apps
  • Interruptions cause complete context loss
  • Hard to resume where you left off
  • No system tracks your work state

Reactive vs Proactive:

  • All AI tools wait for explicit queries
  • No pattern detection (stuck, repetitive actions)
  • Missed opportunities for proactive help

Target User

Knowledge workers spending 6+ hours daily at desktop. Developers, writers, researchers who value efficiency and want AI help without breaking flow.


Solution Architecture

System Design

Two-Tier Architecture:

Frontend (WinUI3/C#):

  • System tray integration
  • Toast notifications
  • Chat panel
  • Settings interface

Backend (Python):

  • Flask REST API (localhost:5000)
  • Screenshot capture service
  • AI provider management
  • Voice input/output
  • Context management
  • Pattern detection
  • System control

Communication: REST API (JSON over HTTP localhost only)

Architectural Principles

Local-First: Primary processing on-device, cloud optional Modular Providers: Abstract interfaces, easy to add new AI services Privacy by Design: User controls everything
Resilient: Graceful degradation, no single point of failure Performance-Conscious: Minimal idle resource usage


Core Components

1. Screenshot Service

Purpose: Capture visual context of desktop activity

Key Features:

  • Multi-monitor support with per-monitor config
  • Configurable frequency (default 30 seconds)
  • JPEG compression (target <500KB per image)
  • Active window detection
  • Privacy filtering (app blacklist, monitor exclusion)

Capture Modes:

  • Active monitor only (default)
  • All monitors
  • Selective monitors
  • Window-specific

Privacy Filtering:

  • Application blacklist (KeePass, banking apps, etc.)
  • Monitor exclusion list
  • Time-based pausing
  • Manual hotkey pause (Win+Shift+P)

Technical:

  • Library: mss (Python) for screenshots
  • Format: JPEG quality 85%
  • Max resolution: 1920x1080 per monitor
  • Change detection via perceptual hashing
  • Background thread capture

2. Context Manager

Purpose: Maintain intelligent compressed representation of work history

Three-Tier Storage:

Immediate Context (High Fidelity):

  • Last 2-3 screenshots with full images
  • Timespan: ~5 minutes
  • For: Real-time queries, proactive notifications

Recent Context (Medium Fidelity):

  • Last 10-20 screenshots as summaries
  • Timespan: 30-60 minutes
  • For: Session continuity, task resumption

Session Context (Low Fidelity):

  • High-level narrative of work session
  • Timespan: Hours to days
  • For: Long-term patterns, summaries

Pattern Detection:

  • Stuck detection (same screen 15+ min)
  • Context switch detection (app/task changes)
  • Repetitive action detection
  • Research pattern detection
  • Error pattern detection

3. AI Provider Manager

Purpose: Abstract interface to multiple AI services with fallback chains

Supported Providers:

Vision Analysis:

  • DeepSeek-VL (local, free, good quality)
  • Claude Sonnet 4 (cloud, excellent quality, ~$0.015/image)
  • OpenAI GPT-4o Vision (cloud alternative)
  • LLaVA (local alternative)

Text-to-Speech:

  • Piper (local, free, good quality)
  • Coqui TTS (local alternative)
  • ElevenLabs (cloud, excellent, $5-22/month)
  • Azure Neural TTS (cloud, cost-effective)

Speech-to-Text:

  • Whisper (local, free, excellent)
  • Azure Speech (cloud, real-time capable)
  • Deepgram (cloud, best cost/performance)

Provider Selection:

  • Priority-based with fallback chains
  • Quota tracking and automatic switching
  • Quality vs cost optimization
  • User-configurable preferences

4. Voice Input Service

Purpose: Hands-free voice command capture and transcription

Features:

  • Wake word detection (optional, using Porcupine)
  • Push-to-talk mode (hotkey: Ctrl+Space)
  • Speech-to-text via Whisper (local)
  • Audio source filtering (avoid YouTube triggers)
  • Noise reduction and enhancement

Voice Activity Detection:

  • Trim silence from recordings
  • Detect when actually speaking
  • Improve transcription accuracy

Performance:

  • Whisper base model: 1-2s latency
  • 16kHz sample rate
  • GPU acceleration if available

5. Voice Output Service

Purpose: Natural-sounding voice responses

Features:

  • Multiple TTS provider support
  • Provider fallback chain
  • Voice personality configuration
  • Response queuing (prevent overlap)
  • Volume and speech rate control

Primary: Piper (local, fast, unlimited) Fallback: ElevenLabs or Azure for higher quality

Response Queuing:

  • High priority (user queries) can interrupt
  • Medium/low priority queued
  • Prevent overlapping audio

6. Proactivity Engine

Purpose: Decide when to speak up vs stay quiet

Detection Patterns:

Stuck Detection:

  • Same screen >15 minutes
  • Same error message repeatedly
  • No progress visible
  • Notification: "You've been stuck on that error for 20 minutes. Want help?"

Context Switch Detection:

  • Major app/task change
  • Offer to remember previous context
  • Remind when returning

Repetitive Action Detection:

  • Same action 3+ times
  • Suggest automation
  • Offer shortcuts

Research Pattern Detection:

  • Multiple sources on same topic
  • Offer to synthesize findings
  • Create reference document

Rate Limiting:

  • Max 2 notifications per hour
  • Min 15 minutes between notifications
  • Learn from user feedback

7. System Control Service

Purpose: Voice-controlled system operations

Commands:

  • Window management: maximize, minimize, close
  • App launching: "open Chrome", "open VS Code"
  • Web navigation: "open Reddit", "search for X"
  • System functions: mute, volume, shutdown, sleep
  • Screen control: screenshot, brightness

Safety:

  • Confirmations for destructive operations
  • Rate limiting on repeated commands
  • Command logging for audit

Configuration System

config.json Structure

Complete system configuration in single JSON file:

{
  "privacy": {
    "mode": "balanced",
    "screenshot": {
      "capture_active_only": true,
      "exclude_monitors": [],
      "excluded_apps": ["KeePass.exe", "*Banking*"],
      "capture_frequency_seconds": 30
    },
    "retention": {
      "screenshot_retention_hours": 24,
      "summary_retention_days": 30
    }
  },
  "vision": {
    "providers": [
      {"name": "deepseek_local", "priority": 1, "enabled": true},
      {"name": "claude", "priority": 2, "enabled": true}
    ]
  },
  "voice_input": {
    "mode": "wake_word",
    "wake_word": {"enabled": true, "keyword": "buddy"}
  },
  "voice_output": {
    "default_voice": {
      "provider": "piper_local",
      "voice": "en_US-amy-medium",
      "style": "neutral"
    },
    "providers": [
      {"name": "piper_local", "priority": 1}
    ]
  },
  "proactivity": {
    "stuck_detection": {"threshold_minutes": 15},
    "rate_limiting": {"max_notifications_per_hour": 2}
  }
}
  • Voice wake word – the backend ignores transcripts that do not start with “Buddy…”, ensuring hands-free commands remain intentional.
  • Custom voice output – adjust voice_output.default_voice to point at Piper, ElevenLabs, or other providers (voice/styling) once the TTS integrators are added; the configuration already propagates to the placeholder synthesizer.

Buddy automatically sorts LLM providers so zero-cost/local options (e.g., DeepSeek-VL, Piper) are tried first. If a provider fails (missing API key, payment required, rate limits, etc.) the registry seamlessly falls back to the next available option without dropping the user request.

Local Model Assets

  • Use python backend/download_models.py --list to view required local models (DeepSeek-VL for screenshots, Whisper Tiny for STT, etc.).
  • Run python backend/download_models.py deepseek-vl-lite before enabling fully-local vision analysis. Add --offline to create placeholder folders when downloading manually.

Installation

  • Run ./install.sh (macOS/Linux/WSL) to provision the Python virtualenv, install backend deps, restore local models, and pre-restore the WinUI3 frontend project. The script will remind you to install the Windows .NET tooling if it is missing.
  • Windows developers can run the same script inside WSL or execute the individual steps manually (pip install -r backend/requirements.txt, python backend/download_models.py ..., dotnet restore frontend/BuddyApp/BuddyApp.csproj).

Output Target: Buddy ships as a WinUI3 desktop application for Windows 11. The backend runs locally (Python) while the frontend builds into a WinUI3 .msix/packaged app. Keep this in mind when planning deployment or installer work.

Testing

  • Run pytest backend/tests tests from an activated virtual environment to execute backend unit tests plus the new high-level wake-word/provider tests under tests/.

Privacy Modes

Paranoid: Everything local, nothing to cloud Balanced: Local first, cloud for complex queries (default) Permissive: Cloud first for quality, local fallback


Technology Stack

Frontend

  • Framework: WinUI3 (C# / .NET 8)
  • UI: Native Windows 11 controls
  • Notifications: Windows Toast API

Backend

  • Language: Python 3.11+
  • Web Framework: Flask
  • Database: SQLite
  • Async: Threading for background tasks

AI/ML

  • Vision: DeepSeek-VL (local), Claude API (cloud)
  • STT: Whisper (local)
  • TTS: Piper (local), ElevenLabs (cloud)
  • Wake Word: Picovoice Porcupine

Libraries

  • Screenshot: mss
  • Audio: sounddevice
  • Image: Pillow
  • Windows APIs: pywin32
  • Perceptual hash: imagehash

Development Roadmap

MVP (Weeks 1-2)

Core Features:

  • Screenshot capture (multi-monitor configurable)
  • Basic context (last 3-5 screenshots)
  • Vision analysis (DeepSeek OR Claude)
  • Voice input (Whisper, push-to-talk)
  • Voice output (Piper)
  • Simple proactive notifications (stuck detection)
  • Basic system commands
  • WinUI3 interface (tray, toasts, chat)
  • JSON configuration
  • Privacy controls

Deliverables:

  • Working application
  • 2-minute demo video
  • GitHub README

Success:

  • Runs stably 1+ hour
  • Voice commands work
  • Can detect stuck pattern

V1.0 (Weeks 3-4)

Enhancements:

  • Multiple provider support with fallback
  • Improved context compression
  • Wake word detection
  • More proactive patterns
  • Extended system commands
  • Settings UI
  • Usage dashboard
  • Installer

V1.1 (Month 2)

Advanced:

  • Pattern learning from feedback
  • Research synthesis
  • Custom commands
  • Browser extension integration
  • Productivity analytics

V2.0 (Month 3+)

Major Features:

  • Vector database for semantic search
  • Long-term memory (months/years)
  • VS Code extension
  • Multi-language support
  • Team features (optional)
  • Mobile companion app

Performance Targets

Resource Usage

Idle:

  • RAM: <100MB
  • CPU: <1%

Active:

  • RAM: <500MB
  • CPU: 10-25%

Latency

  • Voice command to action: <3 seconds
  • Wake word to listening: <500ms
  • Screenshot to analysis: <5 seconds
  • UI interaction: <100ms

Storage

  • Installation: ~2GB (with models)
  • Daily usage: ~50MB
  • Monthly: ~1.5GB

Privacy & Security

Privacy Principles

  1. Local-first processing (default)
  2. User controls all data transmission
  3. No telemetry without consent
  4. Clear visibility into behavior
  5. Easy pause/disable

Security Measures

  • API keys in Windows Credential Manager
  • Database encryption (optional)
  • HTTPS for all cloud calls
  • Local API localhost-only
  • Screenshot secure deletion

Compliance

  • GDPR: Export/delete all data
  • Clear retention policies
  • Audit trail of API calls
  • Opt-in for cloud processing

API Specifications

REST Endpoints

GET /api/status System health check

POST /api/query
Send user query

{
  "query": "What was I working on?",
  "include_screenshot": true
}

GET /api/notifications Poll for proactive notifications

POST /api/control Execute system command

{
  "command": "open_application",
  "target": "chrome"
}

GET /api/context Retrieve context summary

POST /api/feedback Submit feedback on notification


Testing Strategy

Unit Tests

  • Screenshot capture
  • Context compression
  • Pattern detection
  • Provider selection
  • Command parsing

Integration Tests

  • End-to-end voice query
  • Proactive notification flow
  • Provider fallback
  • Context persistence

Performance Tests

  • Response latency
  • Memory usage over time
  • Database query performance

User Acceptance

  • 10-20 beta users
  • 2-week testing period
  • Weekly feedback surveys

Installation & Deployment

Requirements

Minimum:

  • Windows 11 (64-bit)
  • 8GB RAM
  • 5GB disk space
  • Microphone, speakers

Recommended:

  • 16GB RAM
  • NVIDIA GPU (4GB+ VRAM)
  • 10GB disk space

Installation

Installer Method:

  1. Download BuddySetup.exe
  2. Run installer
  3. Configure privacy mode
  4. Launch Buddy

From Source:

  1. Clone repository
  2. Install dependencies: pip install -r requirements.txt
  3. Download models: python download_models.py
  4. Build frontend: cd frontend && dotnet build
  5. Run backend: python backend/main.py
  6. Run frontend: dotnet run --project frontend

Error Handling

Error Categories

Transient: Network timeouts, rate limits

  • Retry with backoff, fall back to alternative

Configuration: Invalid API keys, malformed config

  • Fall back to defaults, notify user

Resource: Out of disk, out of memory

  • Graceful degradation, cleanup old data

User: Unrecognized command, ambiguous request

  • Clear error message, suggest corrections

Recovery

  • Automatic service restart on crash
  • Database backup before operations
  • Resume from last known state
  • Export data as last resort

Success Metrics

MVP Success

  • Runs stably 1+ hour
  • <5s response latency
  • <500MB RAM
  • 90%+ voice accuracy

V1.0 Success

  • Runs stably 8+ hours
  • <3s response latency
  • Provider fallback 100%
  • No crashes in normal use

Product Success

  • 1000+ active users (Year 1)
  • 70%+ weekly retention
  • Users save 30+ min/day
  • NPS >7

Conclusion

Buddy fundamentally changes how knowledge workers interact with desktops. By combining proactive AI, voice interaction, and context awareness, Buddy eliminates context-switching friction and enables hands-free productivity.

The local-first architecture respects privacy while provider-agnostic design prevents vendor lock-in. Users optimize for their priorities: privacy, quality, or cost.

This specification provides a complete blueprint for building Buddy, with sufficient detail to guide implementation while remaining flexible for learnings during development.

Next Steps

  1. Set up development environment
  2. Build MVP core features (weeks 1-2)
  3. Test with real usage
  4. Iterate based on feedback
  5. Launch beta program
  6. Polish for V1.0 release

The goal is shipping a useful product that solves a real problem. Buddy will evolve based on real user needs, starting with the core vision: natural conversation with your desktop while maintaining work context awareness.


Document Status: Planning Complete - Ready for Implementation
Last Updated: December 28, 2024

About

Buddy is a local-first, WinUI 3 desktop assistant for Windows 11 that listens for “Buddy…” voice \commands, watches your workflow via privacy-aware screenshots, and offers proactive help before you even ask. The Python backend captures context, routes commands through a smart provider stack (free on-device models first, falling back to paid)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published