Buddy

Version: 1.0 MVP
Date: December 28, 2024
Author: Christian
Status: Planning Phase - Ready for Implementation
License: MIT

Executive Summary

Buddy is a proactive AI desktop assistant for Windows 11 that enables natural voice interaction with your computer while maintaining awareness of your work context through intelligent screenshot analysis. Unlike reactive chatbots, Buddy observes your workflow and offers help proactively when patterns suggest you're stuck or could benefit from assistance.

Core Philosophy

Local-first architecture with optional cloud enhancement
Proactive assistance rather than reactive queries
Voice-controlled for minimal workflow interruption
Privacy-configurable to user preference
Provider-agnostic - not locked to any AI service

Key Value Propositions

Talk to your desktop naturally while working
Never lose context when switching tasks or interrupted
Get help before you realize you need it (proactive)
Control computer hands-free while typing
Privacy-respecting with local-first processing

Problem Statement

Current Pain Points

Desktop Computing Limitations:

No natural language interface for desktop
Constant context switching to AI tools breaks flow
Must manually provide context every time

Voice Assistant Disconnect:

Alexa/Google Home have no screen awareness
Not useful for actual work tasks
Disconnected from desktop applications

Context Loss:

Mental context lost when switching apps
Interruptions cause complete context loss
Hard to resume where you left off
No system tracks your work state

Reactive vs Proactive:

All AI tools wait for explicit queries
No pattern detection (stuck, repetitive actions)
Missed opportunities for proactive help

Target User

Knowledge workers spending 6+ hours daily at desktop. Developers, writers, researchers who value efficiency and want AI help without breaking flow.

Solution Architecture

System Design

Two-Tier Architecture:

Frontend (WinUI3/C#):

System tray integration
Toast notifications
Chat panel
Settings interface

Backend (Python):

Flask REST API (localhost:5000)
Screenshot capture service
AI provider management
Voice input/output
Context management
Pattern detection
System control

Communication: REST API (JSON over HTTP localhost only)

Architectural Principles

Local-First: Primary processing on-device, cloud optional Modular Providers: Abstract interfaces, easy to add new AI services Privacy by Design: User controls everything
Resilient: Graceful degradation, no single point of failure Performance-Conscious: Minimal idle resource usage

Core Components

1. Screenshot Service

Purpose: Capture visual context of desktop activity

Key Features:

Multi-monitor support with per-monitor config
Configurable frequency (default 30 seconds)
JPEG compression (target <500KB per image)
Active window detection
Privacy filtering (app blacklist, monitor exclusion)

Capture Modes:

Active monitor only (default)
All monitors
Selective monitors
Window-specific

Privacy Filtering:

Application blacklist (KeePass, banking apps, etc.)
Monitor exclusion list
Time-based pausing
Manual hotkey pause (Win+Shift+P)

Technical:

Library: mss (Python) for screenshots
Format: JPEG quality 85%
Max resolution: 1920x1080 per monitor
Change detection via perceptual hashing
Background thread capture

2. Context Manager

Purpose: Maintain intelligent compressed representation of work history

Three-Tier Storage:

Immediate Context (High Fidelity):

Last 2-3 screenshots with full images
Timespan: ~5 minutes
For: Real-time queries, proactive notifications

Recent Context (Medium Fidelity):

Last 10-20 screenshots as summaries
Timespan: 30-60 minutes
For: Session continuity, task resumption

Session Context (Low Fidelity):

High-level narrative of work session
Timespan: Hours to days
For: Long-term patterns, summaries

Pattern Detection:

Stuck detection (same screen 15+ min)
Context switch detection (app/task changes)
Repetitive action detection
Research pattern detection
Error pattern detection

3. AI Provider Manager

Purpose: Abstract interface to multiple AI services with fallback chains

Supported Providers:

Vision Analysis:

DeepSeek-VL (local, free, good quality)
Claude Sonnet 4 (cloud, excellent quality, ~$0.015/image)
OpenAI GPT-4o Vision (cloud alternative)
LLaVA (local alternative)

Text-to-Speech:

Piper (local, free, good quality)
Coqui TTS (local alternative)
ElevenLabs (cloud, excellent, $5-22/month)
Azure Neural TTS (cloud, cost-effective)

Speech-to-Text:

Whisper (local, free, excellent)
Azure Speech (cloud, real-time capable)
Deepgram (cloud, best cost/performance)

Provider Selection:

Priority-based with fallback chains
Quota tracking and automatic switching
Quality vs cost optimization
User-configurable preferences

4. Voice Input Service

Purpose: Hands-free voice command capture and transcription

Features:

Wake word detection (optional, using Porcupine)
Push-to-talk mode (hotkey: Ctrl+Space)
Speech-to-text via Whisper (local)
Audio source filtering (avoid YouTube triggers)
Noise reduction and enhancement

Voice Activity Detection:

Trim silence from recordings
Detect when actually speaking
Improve transcription accuracy

Performance:

Whisper base model: 1-2s latency
16kHz sample rate
GPU acceleration if available

5. Voice Output Service

Purpose: Natural-sounding voice responses

Features:

Multiple TTS provider support
Provider fallback chain
Voice personality configuration
Response queuing (prevent overlap)
Volume and speech rate control

Primary: Piper (local, fast, unlimited) Fallback: ElevenLabs or Azure for higher quality

Response Queuing:

High priority (user queries) can interrupt
Medium/low priority queued
Prevent overlapping audio

6. Proactivity Engine

Purpose: Decide when to speak up vs stay quiet

Detection Patterns:

Stuck Detection:

Same screen >15 minutes
Same error message repeatedly
No progress visible
Notification: "You've been stuck on that error for 20 minutes. Want help?"

Context Switch Detection:

Major app/task change
Offer to remember previous context
Remind when returning

Repetitive Action Detection:

Same action 3+ times
Suggest automation
Offer shortcuts

Research Pattern Detection:

Multiple sources on same topic
Offer to synthesize findings
Create reference document

Rate Limiting:

Max 2 notifications per hour
Min 15 minutes between notifications
Learn from user feedback

7. System Control Service

Purpose: Voice-controlled system operations

Commands:

Window management: maximize, minimize, close
App launching: "open Chrome", "open VS Code"
Web navigation: "open Reddit", "search for X"
System functions: mute, volume, shutdown, sleep
Screen control: screenshot, brightness

Safety:

Confirmations for destructive operations
Rate limiting on repeated commands
Command logging for audit

Configuration System

config.json Structure

Complete system configuration in single JSON file:

{
  "privacy": {
    "mode": "balanced",
    "screenshot": {
      "capture_active_only": true,
      "exclude_monitors": [],
      "excluded_apps": ["KeePass.exe", "*Banking*"],
      "capture_frequency_seconds": 30
    },
    "retention": {
      "screenshot_retention_hours": 24,
      "summary_retention_days": 30
    }
  },
  "vision": {
    "providers": [
      {"name": "deepseek_local", "priority": 1, "enabled": true},
      {"name": "claude", "priority": 2, "enabled": true}
    ]
  },
  "voice_input": {
    "mode": "wake_word",
    "wake_word": {"enabled": true, "keyword": "buddy"}
  },
  "voice_output": {
    "default_voice": {
      "provider": "piper_local",
      "voice": "en_US-amy-medium",
      "style": "neutral"
    },
    "providers": [
      {"name": "piper_local", "priority": 1}
    ]
  },
  "proactivity": {
    "stuck_detection": {"threshold_minutes": 15},
    "rate_limiting": {"max_notifications_per_hour": 2}
  }
}

Voice wake word – the backend ignores transcripts that do not start with “Buddy…”, ensuring hands-free commands remain intentional.
Custom voice output – adjust voice_output.default_voice to point at Piper, ElevenLabs, or other providers (voice/styling) once the TTS integrators are added; the configuration already propagates to the placeholder synthesizer.

Buddy automatically sorts LLM providers so zero-cost/local options (e.g., DeepSeek-VL, Piper) are tried first. If a provider fails (missing API key, payment required, rate limits, etc.) the registry seamlessly falls back to the next available option without dropping the user request.

Local Model Assets

Use python backend/download_models.py --list to view required local models (DeepSeek-VL for screenshots, Whisper Tiny for STT, etc.).
Run python backend/download_models.py deepseek-vl-lite before enabling fully-local vision analysis. Add --offline to create placeholder folders when downloading manually.

Installation

Run ./install.sh (macOS/Linux/WSL) to provision the Python virtualenv, install backend deps, restore local models, and pre-restore the WinUI3 frontend project. The script will remind you to install the Windows .NET tooling if it is missing.
Windows developers can run the same script inside WSL or execute the individual steps manually (pip install -r backend/requirements.txt, python backend/download_models.py ..., dotnet restore frontend/BuddyApp/BuddyApp.csproj).

Output Target: Buddy ships as a WinUI3 desktop application for Windows 11. The backend runs locally (Python) while the frontend builds into a WinUI3 .msix/packaged app. Keep this in mind when planning deployment or installer work.

Testing

Run pytest backend/tests tests from an activated virtual environment to execute backend unit tests plus the new high-level wake-word/provider tests under tests/.

Privacy Modes

Paranoid: Everything local, nothing to cloud Balanced: Local first, cloud for complex queries (default) Permissive: Cloud first for quality, local fallback

Technology Stack

Frontend

Framework: WinUI3 (C# / .NET 8)
UI: Native Windows 11 controls
Notifications: Windows Toast API

Backend

Language: Python 3.11+
Web Framework: Flask
Database: SQLite
Async: Threading for background tasks

AI/ML

Vision: DeepSeek-VL (local), Claude API (cloud)
STT: Whisper (local)
TTS: Piper (local), ElevenLabs (cloud)
Wake Word: Picovoice Porcupine

Libraries

Screenshot: mss
Audio: sounddevice
Image: Pillow
Windows APIs: pywin32
Perceptual hash: imagehash

Development Roadmap

MVP (Weeks 1-2)

Core Features:

Screenshot capture (multi-monitor configurable)
Basic context (last 3-5 screenshots)
Vision analysis (DeepSeek OR Claude)
Voice input (Whisper, push-to-talk)
Voice output (Piper)
Simple proactive notifications (stuck detection)
Basic system commands
WinUI3 interface (tray, toasts, chat)
JSON configuration
Privacy controls

Deliverables:

Working application
2-minute demo video
GitHub README

Success:

Runs stably 1+ hour
Voice commands work
Can detect stuck pattern

V1.0 (Weeks 3-4)

Enhancements:

Multiple provider support with fallback
Improved context compression
Wake word detection
More proactive patterns
Extended system commands
Settings UI
Usage dashboard
Installer

V1.1 (Month 2)

Advanced:

Pattern learning from feedback
Research synthesis
Custom commands
Browser extension integration
Productivity analytics

V2.0 (Month 3+)

Major Features:

Vector database for semantic search
Long-term memory (months/years)
VS Code extension
Multi-language support
Team features (optional)
Mobile companion app

Performance Targets

Resource Usage

Idle:

RAM: <100MB
CPU: <1%

Active:

RAM: <500MB
CPU: 10-25%

Latency

Voice command to action: <3 seconds
Wake word to listening: <500ms
Screenshot to analysis: <5 seconds
UI interaction: <100ms

Storage

Installation: ~2GB (with models)
Daily usage: ~50MB
Monthly: ~1.5GB

Privacy & Security

Privacy Principles

Local-first processing (default)
User controls all data transmission
No telemetry without consent
Clear visibility into behavior
Easy pause/disable

Security Measures

API keys in Windows Credential Manager
Database encryption (optional)
HTTPS for all cloud calls
Local API localhost-only
Screenshot secure deletion

Compliance

GDPR: Export/delete all data
Clear retention policies
Audit trail of API calls
Opt-in for cloud processing

API Specifications

REST Endpoints

GET /api/status System health check

POST /api/query
Send user query

{
  "query": "What was I working on?",
  "include_screenshot": true
}

GET /api/notifications Poll for proactive notifications

POST /api/control Execute system command

{
  "command": "open_application",
  "target": "chrome"
}

GET /api/context Retrieve context summary

POST /api/feedback Submit feedback on notification

Testing Strategy

Unit Tests

Screenshot capture
Context compression
Pattern detection
Provider selection
Command parsing

Integration Tests

End-to-end voice query
Proactive notification flow
Provider fallback
Context persistence

Performance Tests

Response latency
Memory usage over time
Database query performance

User Acceptance

10-20 beta users
2-week testing period
Weekly feedback surveys

Installation & Deployment

Requirements

Minimum:

Windows 11 (64-bit)
8GB RAM
5GB disk space
Microphone, speakers

Recommended:

16GB RAM
NVIDIA GPU (4GB+ VRAM)
10GB disk space

Installation

Installer Method:

Download BuddySetup.exe
Run installer
Configure privacy mode
Launch Buddy

From Source:

Clone repository
Install dependencies: pip install -r requirements.txt
Download models: python download_models.py
Build frontend: cd frontend && dotnet build
Run backend: python backend/main.py
Run frontend: dotnet run --project frontend

Error Handling

Error Categories

Transient: Network timeouts, rate limits

Retry with backoff, fall back to alternative

Configuration: Invalid API keys, malformed config

Fall back to defaults, notify user

Resource: Out of disk, out of memory

Graceful degradation, cleanup old data

User: Unrecognized command, ambiguous request

Clear error message, suggest corrections

Recovery

Automatic service restart on crash
Database backup before operations
Resume from last known state
Export data as last resort

Success Metrics

MVP Success

Runs stably 1+ hour
<5s response latency
<500MB RAM
90%+ voice accuracy

V1.0 Success

Runs stably 8+ hours
<3s response latency
Provider fallback 100%
No crashes in normal use

Product Success

1000+ active users (Year 1)
70%+ weekly retention
Users save 30+ min/day
NPS >7

Conclusion

Buddy fundamentally changes how knowledge workers interact with desktops. By combining proactive AI, voice interaction, and context awareness, Buddy eliminates context-switching friction and enables hands-free productivity.

The local-first architecture respects privacy while provider-agnostic design prevents vendor lock-in. Users optimize for their priorities: privacy, quality, or cost.

This specification provides a complete blueprint for building Buddy, with sufficient detail to guide implementation while remaining flexible for learnings during development.

Next Steps

Set up development environment
Build MVP core features (weeks 1-2)
Test with real usage
Iterate based on feedback
Launch beta program
Polish for V1.0 release

The goal is shipping a useful product that solves a real problem. Buddy will evolve based on real user needs, starting with the core vision: natural conversation with your desktop while maintaining work context awareness.

Document Status: Planning Complete - Ready for Implementation
Last Updated: December 28, 2024

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
backend		backend
docs		docs
frontend		frontend
scripts		scripts
shared		shared
tests		tests
.gitignore		.gitignore
Readme.md		Readme.md
Todo.md		Todo.md
install.sh		install.sh

cschladetsch/PyAiBuddyWin11

Folders and files

Latest commit

History

Repository files navigation

Buddy

Executive Summary

Core Philosophy

Key Value Propositions

Problem Statement

Current Pain Points

Target User

Solution Architecture

System Design

Architectural Principles

Core Components

1. Screenshot Service

2. Context Manager

3. AI Provider Manager

4. Voice Input Service

5. Voice Output Service

6. Proactivity Engine

7. System Control Service

Configuration System

config.json Structure

Local Model Assets

Installation

Testing

Privacy Modes

Technology Stack

Frontend

Backend

AI/ML

Libraries

Development Roadmap

MVP (Weeks 1-2)

V1.0 (Weeks 3-4)

V1.1 (Month 2)

V2.0 (Month 3+)

Performance Targets

Resource Usage

Latency

Storage

Privacy & Security

Privacy Principles

Security Measures

Compliance

API Specifications

REST Endpoints

Testing Strategy

Unit Tests

Integration Tests

Performance Tests

User Acceptance

Installation & Deployment

Requirements

Installation

Error Handling

Error Categories

Recovery

Success Metrics

MVP Success

V1.0 Success

Product Success

Conclusion

Next Steps

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages