diff --git a/README.md b/README.md index dea6b875e..ce4e3da46 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Android Framework +# Lighthouse AI **Operations toolkit for persistent LLM agents — process watchdog, session cleanup, memory reset, API cost monitoring, and tool call proxy.** @@ -31,6 +31,9 @@ what you're building. | [Memory Shepherd](#memory-shepherd) | Periodic memory reset to prevent agent drift | No (any markdown-based agent memory) | Linux | | [Golden Configs](#golden-configs) | Working config templates for OpenClaw + vLLM | Yes | Any | | [Workspace Templates](#workspace-templates) | Agent personality/identity starter files | Yes | Any | +| [LLM Cold Storage](#llm-cold-storage) | Archive idle HuggingFace models to free disk | No | Linux | +| [Docker Compose Stacks](#docker-compose-stacks) | One-command deployment (nano/pro tiers) | No | Any | +| [Cookbook Recipes](#cookbook-recipes) | Step-by-step guides: voice, RAG, code, privacy, multi-GPU, swarms | No | Linux | --- @@ -48,6 +51,9 @@ any framework. | [MULTI-AGENT-PATTERNS.md](docs/MULTI-AGENT-PATTERNS.md) | Coordination protocols, reliability math, sub-agent spawning, echo chamber prevention, supervisor pattern | | [OPERATIONAL-LESSONS.md](docs/OPERATIONAL-LESSONS.md) | Silent failures, memory management, tool calling reliability, production safety, background GPU automation | | [GUARDIAN.md](docs/GUARDIAN.md) | Infrastructure protection, autonomy tiers, immutable watchdogs, defense in depth | +| [Cookbook Recipes](docs/cookbook/) | **Practical step-by-step guides** — voice agents, RAG, code assistant, privacy proxy, multi-GPU, swarms, n8n | +| [Research](docs/research/) | Hardware buying guide, GPU TTS benchmarks, open-source model landscape | +| [Token Monitor Scope](docs/TOKEN-MONITOR-PRODUCT-SCOPE.md) | Token Spy product roadmap, competitive analysis, pricing strategy | ### The Reference Implementation (OpenClaw + vLLM) @@ -128,8 +134,8 @@ For the rationale behind every design choice: **[docs/DESIGN-DECISIONS.md](docs/ ### Option 1: Full Install (Session Cleanup + Proxy) ```bash -git clone https://github.com/Light-Heart-Labs/Android-Framework.git -cd Android-Framework +git clone https://github.com/Light-Heart-Labs/Lighthouse-AI.git +cd Lighthouse-AI # Edit config for your setup nano config.yaml @@ -202,6 +208,41 @@ nano memory-shepherd.conf # Define your agents and baselines sudo ./install.sh # Installs as systemd timer ``` +### Option 6: Docker Compose (Full Stack) + +Deploy a complete local AI stack with one command. Choose your tier: + +```bash +cd compose +cp .env.example .env +nano .env # Set your secrets + +# Pro tier (24GB+ VRAM — vLLM, Whisper, TTS, voice agent, dashboard) +docker compose -f docker-compose.pro.yml up -d + +# Nano tier (CPU only — llama.cpp, dashboard, no voice) +docker compose -f docker-compose.nano.yml up -d +``` + +### Option 7: LLM Cold Storage + +Archive HuggingFace models idle for 7+ days to free disk space. Models stay resolvable via symlink. + +```bash +# Dry run (shows what would be archived) +./scripts/llm-cold-storage.sh + +# Execute for real +./scripts/llm-cold-storage.sh --execute + +# Check status +./scripts/llm-cold-storage.sh --status + +# Install as daily systemd timer +cp systemd/llm-cold-storage.service systemd/llm-cold-storage.timer ~/.config/systemd/user/ +systemctl --user enable --now llm-cold-storage.timer +``` + --- ## Configuration @@ -332,7 +373,7 @@ See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full deep dive. ## Project Structure ``` -Android-Framework/ +Lighthouse-AI/ ├── config.yaml # Configuration (edit this first) ├── install.sh # Linux installer ├── install.ps1 # Windows installer @@ -343,8 +384,13 @@ Android-Framework/ ├── scripts/ │ ├── session-cleanup.sh # Session watchdog script │ ├── vllm-tool-proxy.py # vLLM tool call proxy (v4) +│ ├── llm-cold-storage.sh # Archive idle HuggingFace models │ ├── start-vllm.sh # Start vLLM via Docker │ └── start-proxy.sh # Start the tool call proxy +├── compose/ +│ ├── docker-compose.pro.yml # Full GPU stack (vLLM + voice + dashboard) +│ ├── docker-compose.nano.yml # CPU-only minimal stack +│ └── .env.example # Environment template ├── token-spy/ # API cost & usage monitor │ ├── main.py # Proxy server + embedded dashboard │ ├── db.py # SQLite storage layer @@ -363,7 +409,9 @@ Android-Framework/ │ ├── openclaw-session-cleanup.service │ ├── openclaw-session-cleanup.timer │ ├── vllm-tool-proxy.service -│ └── token-spy@.service # Token Spy (templated per-agent) +│ ├── token-spy@.service # Token Spy (templated per-agent) +│ ├── llm-cold-storage.service # Model archival (oneshot) +│ └── llm-cold-storage.timer # Daily trigger for cold storage ├── memory-shepherd/ # Periodic memory reset for agents │ ├── memory-shepherd.sh # Config-driven reset script │ ├── memory-shepherd.conf.example # Example agent config @@ -385,9 +433,23 @@ Android-Framework/ │ ├── SETUP.md # Full local setup guide │ ├── ARCHITECTURE.md # How it all fits together │ ├── TOKEN-SPY.md # Token Spy setup & API reference +│ ├── TOKEN-MONITOR-PRODUCT-SCOPE.md # Token Spy product roadmap & competitive analysis │ ├── OPERATIONAL-LESSONS.md # Hard-won lessons from 24/7 agent ops │ ├── MULTI-AGENT-PATTERNS.md # Coordination, swarms, and reliability -│ └── GUARDIAN.md # Infrastructure protection & autonomy tiers +│ ├── GUARDIAN.md # Infrastructure protection & autonomy tiers +│ ├── cookbook/ # Step-by-step practical recipes +│ │ ├── 01-voice-agent-setup.md # Whisper + vLLM + Kokoro +│ │ ├── 02-document-qa-setup.md # RAG with Qdrant/ChromaDB +│ │ ├── 03-code-assistant-setup.md # Tool-calling code agent +│ │ ├── 04-privacy-proxy-setup.md # PII-stripping API proxy +│ │ ├── 05-multi-gpu-cluster.md # Multi-node load balancing +│ │ ├── 06-swarm-patterns.md # Sub-agent parallelization +│ │ ├── 08-n8n-local-llm.md # Workflow automation +│ │ └── agent-template-code.md # Agent template with debugging protocol +│ └── research/ # Technical research & benchmarks +│ ├── HARDWARE-GUIDE.md # GPU buying guide with real prices +│ ├── GPU-TTS-BENCHMARK.md # TTS latency benchmarks +│ └── OSS-MODEL-LANDSCAPE-2026-02.md # Open-source model comparison └── LICENSE ``` @@ -461,6 +523,8 @@ See [docs/SETUP.md](docs/SETUP.md) for the full troubleshooting guide. Quick hit - **[docs/DESIGN-DECISIONS.md](docs/DESIGN-DECISIONS.md)** — Why we made the choices we did: session limits, ping cycles, deterministic supervision, and more - **[docs/PATTERNS.md](docs/PATTERNS.md)** — Six transferable patterns for autonomous agent systems, applicable to any framework - **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — Deep dive on the vLLM Tool Call Proxy internals +- **[docs/cookbook/](docs/cookbook/)** — Practical step-by-step recipes for voice, RAG, code, privacy, multi-GPU, swarms, and workflow automation +- **[docs/research/](docs/research/)** — Hardware guide, GPU benchmarks, open-source model landscape - **Android-Labs** (private) — Proof of work: 3,464 commits from 3 AI agents in 8 days --- diff --git a/compose/.env.example b/compose/.env.example new file mode 100644 index 000000000..73443c65d --- /dev/null +++ b/compose/.env.example @@ -0,0 +1,37 @@ +# Generate secure keys with: openssl rand -base64 32 + +# Dream Server Environment Configuration +# Copy this file to .env and fill in your actual values +# NEVER commit .env files with real secrets to git + +# ============================================ +# REQUIRED: LiveKit Credentials +# ============================================ +# These MUST be changed from defaults before running in production +# Generate strong secrets: openssl rand -base64 32 +LIVEKIT_API_KEY=change-me-to-a-secure-key +LIVEKIT_API_SECRET=change-me-to-a-secure-secret-min-32-chars-long + +# LiveKit connection URL (used by services to connect) +LIVEKIT_URL=ws://localhost:7880 + +# ============================================ +# Optional: Service Hosts (for Docker networking) +# ============================================ +# These are auto-configured for Docker Compose but can be overridden +# VLLM_HOST=vllm +# WHISPER_HOST=whisper +# KOKORO_HOST=kokoro +# QDRANT_HOST=qdrant +# N8N_HOST=n8n + +# ============================================ +# Optional: Model Configuration +# ============================================ +# VLLM_MODEL=Qwen/Qwen2.5-7B-Instruct-AWQ +# LLM_MODEL=Qwen/Qwen2.5-7B-Instruct-AWQ + +# ============================================ +# Optional: Dashboard API +# ============================================ +# DASHBOARD_ALLOWED_ORIGINS=http://localhost:3001,http://127.0.0.1:3001 diff --git a/compose/docker-compose.nano.yml b/compose/docker-compose.nano.yml new file mode 100644 index 000000000..6232b759a --- /dev/null +++ b/compose/docker-compose.nano.yml @@ -0,0 +1,63 @@ +# Dream Server — Nano Tier +# 8GB+ RAM, no GPU required — 1-3B models, text-only +# Usage: docker compose -f docker-compose.nano.yml up -d +# +# Note: Voice features disabled (no GPU for real-time STT/TTS) +# Use text chat via API or dashboard + +services: + # ═══════════════════════════════════════════════════════════════ + # LLM — Qwen2.5-1.5B via llama.cpp (CPU) + # ═══════════════════════════════════════════════════════════════ + llama: + image: ghcr.io/ggerganov/llama.cpp:server + container_name: dream-llama + ports: + - "8000:8080" + volumes: + - ${MODELS_DIR:-~/.cache/models}:/models + command: > + --model /models/qwen2.5-1.5b-instruct-q4_k_m.gguf + --ctx-size 8192 + --n-gpu-layers 0 + --threads 4 + --host 0.0.0.0 + --port 8080 + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8080/health"] + interval: 30s + timeout: 10s + retries: 3 + start_period: 60s + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # Dashboard + API (no voice features) + # ═══════════════════════════════════════════════════════════════ + dashboard: + build: + context: ./dashboard + dockerfile: Dockerfile + container_name: dream-dashboard + ports: + - "3001:3001" + environment: + - VITE_API_URL=http://localhost:3002 + - VITE_VOICE_ENABLED=false + depends_on: + - api + restart: unless-stopped + + api: + build: + context: ./api + dockerfile: Dockerfile + container_name: dream-api + ports: + - "3002:3002" + environment: + - LLM_URL=http://llama:8080 + - VOICE_ENABLED=false + depends_on: + - llama + restart: unless-stopped diff --git a/compose/docker-compose.pro.yml b/compose/docker-compose.pro.yml new file mode 100644 index 000000000..9c88cf502 --- /dev/null +++ b/compose/docker-compose.pro.yml @@ -0,0 +1,184 @@ +# Dream Server — Pro Tier +# 24GB+ VRAM — 32B models, full voice stack +# Usage: docker compose -f docker-compose.pro.yml up -d + +services: + # ═══════════════════════════════════════════════════════════════ + # LLM — Qwen2.5-Coder-32B + # ═══════════════════════════════════════════════════════════════ + vllm: + image: vllm/vllm-openai:latest + runtime: nvidia + container_name: dream-vllm + environment: + - NVIDIA_VISIBLE_DEVICES=all + - VLLM_ATTENTION_BACKEND=FLASHINFER + volumes: + - ${HF_HOME:-~/.cache/huggingface}:/root/.cache/huggingface + ports: + - "8000:8000" + command: > + --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ + --quantization awq + --max-model-len 32768 + --gpu-memory-utilization 0.90 + --enable-auto-tool-choice + --tool-call-parser hermes + --served-model-name gpt-4o + --trust-remote-code + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8000/health"] + interval: 30s + timeout: 10s + retries: 5 + start_period: 300s + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # STT — Whisper Large v3 + # ═══════════════════════════════════════════════════════════════ + whisper: + image: fedirz/faster-whisper-server:latest-cuda + runtime: nvidia + container_name: dream-whisper + environment: + - WHISPER__MODEL=Systran/faster-whisper-large-v3 + - WHISPER__DEVICE=cuda + - NVIDIA_VISIBLE_DEVICES=all + ports: + - "8001:8000" + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8000/health"] + interval: 30s + timeout: 10s + retries: 3 + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # TTS — Kokoro (GPU-accelerated) + # ═══════════════════════════════════════════════════════════════ + kokoro: + image: ghcr.io/remsky/kokoro-fastapi-gpu:latest + runtime: nvidia + container_name: dream-kokoro + environment: + - NVIDIA_VISIBLE_DEVICES=all + ports: + - "8880:8880" + volumes: + - kokoro-cache:/app/cache + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] + healthcheck: + test: ["CMD", "curl", "-f", "http://localhost:8880/health"] + interval: 30s + timeout: 10s + retries: 3 + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # LiveKit — WebRTC Server + # ═══════════════════════════════════════════════════════════════ + livekit: + image: livekit/livekit-server:latest + container_name: dream-livekit + ports: + - "7880:7880" # HTTP + - "7881:7881" # WebRTC TCP + - "7882:7882/udp" # WebRTC UDP + command: > + --config /livekit.yaml + volumes: + - ./livekit.yaml:/livekit.yaml:ro + healthcheck: + test: ["CMD", "wget", "--spider", "-q", "http://localhost:7880"] + interval: 10s + timeout: 5s + retries: 3 + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # Voice Agent — Connects LLM + STT + TTS via LiveKit + # ═══════════════════════════════════════════════════════════════ + voice-agent: + build: + context: ./voice-agent + dockerfile: Dockerfile + container_name: dream-voice-agent + environment: + - LIVEKIT_URL=ws://livekit:7880 + - LIVEKIT_API_KEY=${LIVEKIT_API_KEY:?LIVEKIT_API_KEY must be set} + - LIVEKIT_API_SECRET=${LIVEKIT_API_SECRET:?LIVEKIT_API_SECRET must be set} + - LLM_BASE_URL=http://vllm:8000/v1 + - STT_BASE_URL=http://whisper:8000 + - TTS_BASE_URL=http://kokoro:8880 + depends_on: + vllm: + condition: service_healthy + whisper: + condition: service_healthy + kokoro: + condition: service_healthy + livekit: + condition: service_healthy + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # Dashboard — Web UI + # ═══════════════════════════════════════════════════════════════ + dashboard: + build: + context: ./dashboard + dockerfile: Dockerfile + container_name: dream-dashboard + ports: + - "3001:3001" + environment: + - VITE_API_URL=http://localhost:3002 + - VITE_LIVEKIT_URL=ws://localhost:7880 + depends_on: + - api + restart: unless-stopped + + # ═══════════════════════════════════════════════════════════════ + # API — Backend for Dashboard + # ═══════════════════════════════════════════════════════════════ + api: + build: + context: ./api + dockerfile: Dockerfile + container_name: dream-api + ports: + - "3002:3002" + environment: + - VLLM_URL=http://vllm:8000 + - WHISPER_URL=http://whisper:8000 + - KOKORO_URL=http://kokoro:8880 + - LIVEKIT_URL=ws://livekit:7880 + - LIVEKIT_API_KEY=${LIVEKIT_API_KEY:?LIVEKIT_API_KEY must be set} + - LIVEKIT_API_SECRET=${LIVEKIT_API_SECRET:?LIVEKIT_API_SECRET must be set} + depends_on: + - vllm + restart: unless-stopped + +volumes: + kokoro-cache: diff --git a/config.yaml b/config.yaml index 87b08a5d3..b662769ca 100644 --- a/config.yaml +++ b/config.yaml @@ -1,5 +1,5 @@ -# Android Framework - Configuration -# https://github.com/Light-Heart-Labs/Android-Framework +# Lighthouse AI - Configuration +# https://github.com/Light-Heart-Labs/Lighthouse-AI # ───────────────────────────────────────────── # Session Cleanup Settings @@ -99,6 +99,27 @@ token_spy: # Agents using local/self-hosted models (comma-separated, get $0 cost badge) local_model_agents: "" +# ───────────────────────────────────────────── +# LLM Cold Storage +# ───────────────────────────────────────────── +llm_cold_storage: + # Archive HuggingFace models not accessed in N days to cold storage. + # Symlinks preserve cache resolution. Safe: dry-run by default. + enabled: false + + # HuggingFace cache directory + hf_cache_dir: "~/.cache/huggingface/hub" + + # Where to move archived models + cold_dir: "~/llm-cold-storage" + + # Archive models idle for this many days + max_idle_days: 7 + + # Models to never archive (HuggingFace cache directory names) + # Example: ["models--Qwen--Qwen3-Coder-Next-FP8"] + protected_models: [] + # ───────────────────────────────────────────── # System User # ───────────────────────────────────────────── diff --git a/docs/MULTI-AGENT-PATTERNS.md b/docs/MULTI-AGENT-PATTERNS.md index 94536230b..903e52714 100644 --- a/docs/MULTI-AGENT-PATTERNS.md +++ b/docs/MULTI-AGENT-PATTERNS.md @@ -96,125 +96,20 @@ Don't use redundancy for: ## Sub-Agent Spawning -### Task Templates That Work - -The difference between a 30% and 90% success rate often comes down to how the -task is written. - -**High success (~90%):** - -``` -You are a [ROLE] agent. - -Complete ALL of these steps: - -1. Run: ssh user@192.168.0.100 "[COMMAND_1]" -2. Run: ssh user@192.168.0.100 "[COMMAND_2]" -3. Run: ssh user@192.168.0.100 "[COMMAND_3]" -4. Write ALL findings to: /absolute/path/to/output.md - -Include raw command outputs. Do not summarize or omit. -Do not stop until the file is written. -Reply "Done". Do not output JSON. Do not loop. -``` - -**What makes it work:** -1. Explicit commands (not "check the system" — actual commands to run) -2. Numbered steps (1, 2, 3 — not prose paragraphs) -3. Absolute file paths (not relative, not "save it somewhere") -4. Reinforcement ("do not stop until the file is written") -5. Stop prompt ("Reply Done. Do not output JSON. Do not loop.") -6. Single focus (one role, one objective) - -**Low success (~30-40%):** -- Indirect instructions: "SSH as: user@host" instead of "Run: ssh user@host ..." -- Ambiguous scope: "Document all security configuration" -- Multi-server tasks: "Check both server A and server B" -- Open-ended exploration: "Look around and report what you find" -- Complex conditional logic in a single task - -### When to Spawn vs. Do Directly - -**Rule of thumb:** If you can write the task as one clear sentence with no -"and then," it's spawn-able. - -| Spawn | Do Directly | -|---|---| -| Pure research, multiple independent questions | Needs tool execution with complex chains | -| Repetitive validation across artifacts | Time-sensitive, need it now | -| Document generation from clear templates | Complex multi-step workflows | -| Data gathering, parallel searches | Tasks requiring decisions mid-execution | - -### Resource Management - -Each sub-agent consumes GPU memory. On a single GPU: - -| GPU Load | Concurrent Agents | Recommendation | -|---|---|---| -| Light | 1-4 | Fast, reliable | -| Medium | 5-8 | Good throughput, optimal sweet spot | -| Heavy | 9-12 | Some queuing expected | -| Overloaded | 13+ | Timeouts likely | - -**Pre-spawn health check:** -```bash -# Check VRAM before spawning -curl localhost:9199/status | jq '.nodes[].vram_percent' -# If > 90%, defer spawning or use a lighter approach -``` - -**Timeouts are mandatory.** Without `runTimeoutSeconds`, local models can loop -indefinitely. Recommended values: - -| Task Complexity | Timeout | -|---|---| -| Simple (file write, single command) | 60s | -| Multi-step (3-5 actions) | 120s | -| Complex research | 180s | - -### Spawning Patterns - -**Pattern 1: Research Fan-Out** - -Spawn N agents, each with one focused question. Each writes findings to a -specific file. Coordinator aggregates. - -``` -Coordinator - ├── Agent 1: "What are the top 3 embedding models for code search?" - │ → writes to /tmp/research/embeddings.md - ├── Agent 2: "What vector databases support hybrid search?" - │ → writes to /tmp/research/vector-dbs.md - └── Agent 3: "What's the state of the art in code chunking?" - → writes to /tmp/research/chunking.md -``` - -**Constraint:** Each agent gets ONE question. Don't overload. - -**Pattern 2: Validation Sweep** - -Define validation criteria. Spawn one agent per artifact. Agents report -pass/fail with specific issues. - -Good for: testing multiple configs, validating documentation accuracy, -checking multiple endpoints. - -**Pattern 3: Document Generation** - -Define a template. Spawn agents with specific content assignments. Works well -for API docs, how-to guides, research summaries. - -Fails for: docs requiring tool execution, cross-file coordination, or content -that depends on other agents' output. - -### Anti-Patterns - -| Anti-Pattern | Why It Fails | -|---|---| -| Tool-heavy sub-agents | Local models output tool calls as plain text JSON | -| Overloaded task scope | Too many objectives = shallow coverage on all of them | -| Cross-agent dependencies | Sub-agents can't read each other's output mid-run | -| Long-running complex chains | Multi-step workflows with decision points derail | +Sub-agent spawning is the most powerful parallelization primitive for local +agents. The key insights: + +- **Task templates matter more than model quality** — the difference between + 30% and 90% success rates is how the task is written (numbered steps, + absolute paths, stop prompts) +- **One question per agent** — fan out N focused tasks, aggregate results +- **Timeouts are mandatory** — without them, local models loop indefinitely +- **Resource-aware spawning** — 5-8 concurrent agents is the sweet spot on a + single GPU; beyond 12, timeouts become likely + +For the full treatment — task templates, spawning patterns, resource management +tables, and anti-patterns — see +[cookbook/06-swarm-patterns.md](cookbook/06-swarm-patterns.md). --- diff --git a/docs/OPERATIONAL-LESSONS.md b/docs/OPERATIONAL-LESSONS.md index 72945c99e..514339021 100644 --- a/docs/OPERATIONAL-LESSONS.md +++ b/docs/OPERATIONAL-LESSONS.md @@ -379,3 +379,14 @@ None of these block agent work — they run during idle windows. If an agent needs the GPU, inference requests from the background systems simply queue behind the agent's requests (vLLM's continuous batching handles this transparently). + +--- + +## Further Reading + +- [research/HARDWARE-GUIDE.md](research/HARDWARE-GUIDE.md) — GPU buying guide + with tier rankings and price-performance analysis +- [research/GPU-TTS-BENCHMARK.md](research/GPU-TTS-BENCHMARK.md) — TTS latency + benchmarks (GPU vs CPU, concurrency scaling) +- [research/OSS-MODEL-LANDSCAPE-2026-02.md](research/OSS-MODEL-LANDSCAPE-2026-02.md) — + Open-source model comparison with tool-calling success rates diff --git a/docs/PATTERNS.md b/docs/PATTERNS.md index a7589c985..e95eedb94 100644 --- a/docs/PATTERNS.md +++ b/docs/PATTERNS.md @@ -295,4 +295,4 @@ The patterns compose well. Each addresses a different failure mode, and each is --- -*These patterns will evolve. If you discover improvements, [open an issue](https://github.com/Light-Heart-Labs/Android-Framework/issues).* +*These patterns will evolve. If you discover improvements, [open an issue](https://github.com/Light-Heart-Labs/Lighthouse-AI/issues).* diff --git a/docs/PHILOSOPHY.md b/docs/PHILOSOPHY.md index 2bb866e4f..48629e7e1 100644 --- a/docs/PHILOSOPHY.md +++ b/docs/PHILOSOPHY.md @@ -224,7 +224,14 @@ Read the single-agent path first, then: 5. [MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md) — Coordination, swarms, redundancy, the supervisor pattern -6. [GUARDIAN.md](GUARDIAN.md) — Infrastructure protection and autonomy tiers +6. [cookbook/06-swarm-patterns.md](cookbook/06-swarm-patterns.md) — Hands-on + sub-agent spawning patterns with code examples +7. [GUARDIAN.md](GUARDIAN.md) — Infrastructure protection and autonomy tiers + +### "I want to build something specific" + +Browse the [Cookbook](cookbook/README.md) — step-by-step recipes for voice +agents, document Q&A, code assistants, multi-GPU clusters, and more. ### "I want to understand the theory without building anything" @@ -314,11 +321,21 @@ PHILOSOPHY.md (you are here) ├── Infrastructure & Safety │ └── GUARDIAN.md — Watchdogs, autonomy tiers, protection │ + ├── Cookbook (cookbook/) — Step-by-step build recipes + │ ├── 01-voice-agent-setup.md — Whisper + vLLM + Kokoro pipeline + │ ├── 05-multi-gpu-cluster.md — Multi-GPU cluster guide + │ └── 06-swarm-patterns.md — Sub-agent parallelization patterns + │ + ├── Research (research/) — Benchmarks and hardware analysis + │ ├── HARDWARE-GUIDE.md — GPU buying guide + │ └── OSS-MODEL-LANDSCAPE-2026-02.md — Open-source model comparison + │ └── Reference Implementation (OpenClaw + vLLM) ├── README.md — Toolkit overview and quick start ├── ARCHITECTURE.md — How OpenClaw talks to vLLM ├── SETUP.md — Step-by-step local deployment - └── TOKEN-SPY.md — Cost monitoring setup and API + ├── TOKEN-SPY.md — Cost monitoring setup and API + └── TOKEN-MONITOR-PRODUCT-SCOPE.md — Token Spy product roadmap ``` The top three sections are framework-agnostic. The reference implementation diff --git a/docs/SETUP.md b/docs/SETUP.md index 58dd54037..edd092e97 100644 --- a/docs/SETUP.md +++ b/docs/SETUP.md @@ -348,3 +348,12 @@ To connect your agent to Discord, add a `channels` section to `openclaw.json`: Set `requireMention: false` for channels where the agent should respond to every message, or `true` for channels where it only responds when @mentioned. + +--- + +## Further Reading + +- [Cookbook recipes](cookbook/README.md) — Step-by-step guides for voice agents, + document Q&A, code assistants, multi-GPU clusters, and more +- [research/HARDWARE-GUIDE.md](research/HARDWARE-GUIDE.md) — GPU buying guide + with tier rankings, used market analysis, and price-performance comparisons diff --git a/docs/TOKEN-MONITOR-PRODUCT-SCOPE.md b/docs/TOKEN-MONITOR-PRODUCT-SCOPE.md new file mode 100644 index 000000000..1a667c0ce --- /dev/null +++ b/docs/TOKEN-MONITOR-PRODUCT-SCOPE.md @@ -0,0 +1,402 @@ +# Token Spy — Product Scope & Roadmap +*(formerly OpenClaw Token Monitor)* + +## Executive Summary + +OpenClaw Token Monitor is a **transparent API proxy** that captures per-request token usage, cost, and session health metrics for LLM-powered agents — with **zero code changes** to downstream applications. It currently runs as a personal tool monitoring two AI agents across two LLM providers (Anthropic, Moonshot/Kimi). + +This document scopes the path from personal tool to commercial product, targeting developers and teams running LLM-powered agents, workflows, and applications who need visibility into what they're spending, where, and why. + +--- + +## Core Value Proposition + +**"See everything your AI spends. Change nothing in your code."** + +Unlike SDK-based observability tools (LangSmith, Langfuse, W&B Weave) that require instrumenting every call site, and unlike competing proxy tools (Helicone, Portkey) that still require a base URL change and auth header, OpenClaw Token Monitor operates as a truly transparent proxy — point your agent's traffic through it and every LLM interaction is automatically captured, analyzed, and visualized. + +### Why This Matters + +- **Zero integration friction** — No SDK, no framework lock-in, no code changes. Works with any language, any LLM client library, any agent framework. +- **Session intelligence** — Not just request logging. Understands conversation arcs, detects session boundaries, tracks context window growth, and recommends when to reset. +- **Prompt cost attribution** — Breaks down what's actually eating tokens: system prompt components, workspace files, skill injections, conversation history. No other tool does this at the proxy level. +- **Operational safety** — Auto-resets runaway sessions before they burn through budgets. Acts as both observer and guardrail. + +--- + +## Competitive Landscape + +| Tool | Approach | Integration Effort | Strengths | Weakness vs. Us | +|------|----------|-------------------|-----------|-----------------| +| **Helicone** | Proxy gateway (Rust/CF Workers) | Base URL + API key header change | Mature, open source, 2B+ interactions | Still requires code change; no session intelligence | +| **Portkey** | AI gateway | Base URL change + SDK optional | 200+ providers, guardrails, enterprise governance | Heavy/complex; no prompt-level cost attribution | +| **Langfuse** | SDK instrumentation | SDK integration per call site | Open source, deep tracing, self-hostable | Framework coupling; maintenance burden | +| **LangSmith** | SDK (LangChain native) | LangChain/LangGraph integration | Deep chain tracing, evaluation | Ecosystem lock-in; useless outside LangChain | +| **Datadog LLM** | SDK instrumentation | Python SDK + Datadog agent | Integrates with existing infra monitoring | Enterprise pricing; Python-only; heavy stack | +| **Groundcover** | eBPF kernel-level | Zero (but K8s + eBPF required) | Truly zero instrumentation | K8s-only; no session awareness; infrastructure-focused | +| **Braintrust** | SDK + eval platform | SDK integration | Strong evaluation/scoring | Evaluation-first, not operations-first | + +### Our Differentiated Position + +1. **Transparent proxy** — zero code changes, works in any environment (not just K8s) +2. **Session-aware intelligence** — conversation arc tracking, auto-reset, cache efficiency analysis +3. **Prompt cost decomposition** — see exactly which parts of your system prompt are costing money +4. **Operational safety rails** — budget enforcement and runaway session protection built into the proxy layer + +--- + +## What Exists Today + +### Current Architecture +``` +Agent-A ──► Proxy ──► api.anthropic.com +Agent-B ──► Proxy ──► api.moonshot.ai + │ + SQLite DB (usage.db) + │ + Dashboard (served by proxy) +``` + +### Current Capabilities +- Transparent proxy for Anthropic Messages API and OpenAI-compatible Chat Completions API +- SSE streaming passthrough with zero buffering +- Per-turn logging: model, tokens (input/output/cache_read/cache_write), cost, latency, stop reason +- Request analysis: message count by role, tool count, request body size +- System prompt decomposition: workspace files (AGENTS.md, SOUL.md, etc.), skill injections, base prompt +- Conversation history char tracking across turns +- Session boundary detection (history drop = new session) +- Session health scoring with recommendations (healthy → monitor → compact_soon → reset_recommended → cache_unstable) +- Auto-reset safety valve (kills sessions exceeding 200K chars) +- External session manager (cron job, cleans inactive sessions, enforces count limits) +- Dashboard: summary cards, cost-per-turn timeline, history growth chart, token usage bars, cost breakdown doughnut, cumulative cost, recent turns table, session health panels with reset buttons +- Cost estimation with per-model pricing tables (8 Claude variants, 4 Kimi variants) +- Protocol translation (OpenAI `developer` role → `system` for Kimi compatibility) + +### Current Limitations +- Single-user, hardcoded agent names and session directories +- Two providers only (Anthropic, Moonshot), each requiring a separate handler +- SQLite with thread-local connections (single-node only) +- Dashboard is inline HTML in main.py (no component framework, no auth) +- No alerting, no budgets, no API keys for the proxy itself +- No data export, no retention policies, no multi-node deployment + +--- + +## Product Roadmap + +### Phase 1: Foundation (Weeks 1–6) +**Goal: Multi-user, multi-provider proxy that anyone can self-host.** + +#### 1.1 Provider Plugin System +Generalize the two existing proxy handlers into a provider adapter interface. + +- **Provider adapter contract**: Each provider implements `parse_request()`, `forward_streaming()`, `forward_sync()`, `extract_usage()`, `estimate_cost()` +- **Built-in adapters**: Anthropic Messages API, OpenAI Chat Completions API (covers OpenAI, Azure OpenAI, Moonshot/Kimi, Groq, Together, Fireworks, DeepSeek, any OpenAI-compatible) +- **Google Vertex/Gemini adapter**: Third priority given market share +- **Configuration-driven**: Provider endpoints, cost tables, and model mappings defined in YAML/TOML config, not code +- **Custom cost tables**: Users override per-model pricing to match their negotiated rates or fine-tuned model costs + +```yaml +providers: + anthropic: + base_url: https://api.anthropic.com + adapter: anthropic_messages + models: + claude-sonnet-4: + input: 3.00 + output: 15.00 + cache_read: 0.30 + cache_write: 3.75 + + openai: + base_url: https://api.openai.com + adapter: openai_chat + models: + gpt-4o: + input: 2.50 + output: 10.00 +``` + +#### 1.2 Multi-Tenancy & Auth +- **Proxy API keys**: Customers generate keys that authenticate requests to the proxy. The proxy maps keys to tenants and attaches metadata (tenant, agent, environment) to every logged request. +- **Tenant isolation**: All queries scoped by tenant. No cross-tenant data leakage. +- **Dashboard auth**: Session-based login for the web dashboard. Each tenant sees only their data. +- **Provider key management**: Customers register their own provider API keys (encrypted at rest). The proxy injects the correct key when forwarding upstream. + +#### 1.3 Database Migration +- **PostgreSQL** as the primary store for transactional data (tenants, API keys, provider configs) +- **TimescaleDB extension** (or ClickHouse) for the usage time-series data — enables fast aggregation queries over large time ranges without manual rollup tables +- **Migration path**: Script to import existing SQLite data +- **Retention policies**: Configurable per-tenant (e.g., raw data for 30 days, hourly rollups for 1 year) + +#### 1.4 Configuration & Deployment +- **YAML/TOML config file** replacing all hardcoded values (agent names, thresholds, upstream URLs, cost tables) +- **Docker Compose** for self-hosted deployment (proxy + postgres + dashboard) +- **Environment variable overrides** for 12-factor compatibility +- **Health check endpoints** with dependency status (upstream providers reachable, DB connected) + +**Phase 1 Deliverable**: A self-hostable Docker Compose stack that any developer can deploy, create an API key, point their agents at, and immediately see usage data in an authenticated dashboard. Supports any OpenAI-compatible or Anthropic-compatible provider out of the box. + +--- + +### Phase 2: Analytics Dashboard (Weeks 7–12) +**Goal: A real frontend that makes the data actionable.** + +#### 2.1 Dashboard Rebuild +- **Next.js + React** frontend (or SvelteKit — lighter weight, good fit for data dashboards) +- **Responsive design** preserving the current dark theme aesthetic +- **Real-time updates** via WebSocket or Server-Sent Events (watch agents work live) +- **Time range picker** with presets (1h, 6h, 24h, 7d, 30d, custom range) +- **Auto-refresh** with configurable interval + +#### 2.2 Core Analytics Views + +**Overview Dashboard** +- Total spend (period), trend vs. previous period +- Active agents/workflows count +- Request volume and error rate +- Top spenders (by agent, model, provider) +- Cost forecast based on current burn rate + +**Agent/Workflow Explorer** +- Per-agent drill-down: cost over time, token distribution, session timeline +- Session replay: step through a session's turns, see cost accumulate, identify expensive turns +- Conversation arc visualization: history growth, cache efficiency over session lifetime +- Compare agents side-by-side (cost efficiency, token patterns, model usage) + +**Model Analytics** +- Cost per model over time +- Token efficiency by model (output tokens per dollar) +- Latency distribution by model and provider +- Cache hit rates by model (which models benefit most from prompt caching?) +- Model comparison: "Switching Agent X from Opus to Sonnet would save $Y/day based on last 7 days" + +**Prompt Economics** +- System prompt cost attribution: what percentage of input cost goes to system prompt vs. conversation history vs. tool definitions? +- Prompt component breakdown over time (unique to OpenClaw — no competitor has this) +- "Your AGENTS.md file costs $0.003 per turn across 200 turns/day = $0.60/day. Is it worth it?" +- Workspace file size trends — detect prompt bloat early + +**Cost & Budget** +- Cumulative cost by any dimension (agent, model, provider, tag, time) +- Budget configuration per agent/team/tag with alerts +- Projected monthly cost based on rolling averages +- Cost anomaly detection (sudden spend spikes) + +#### 2.3 Tagging & Metadata +- **Request tags**: Arbitrary key-value metadata attached to requests via HTTP headers (e.g., `X-OpenClaw-Tags: env=prod,workflow=customer-support,team=backend`) +- **Agent auto-detection**: Infer agent identity from API key, request patterns, or explicit header +- **Environment segmentation**: dev/staging/prod cost breakdowns +- **Custom dimensions**: Let users define their own grouping dimensions + +**Phase 2 Deliverable**: A polished, real-time analytics dashboard that turns raw telemetry into actionable insights about cost, efficiency, and agent behavior. The prompt economics view is the flagship differentiator. + +--- + +### Phase 3: Intelligence & Automation (Weeks 13–20) +**Goal: The proxy doesn't just observe — it advises and acts.** + +#### 3.1 Alerting & Budgets +- **Alert rules**: Configurable triggers on any metric (cost > $X/hour, cache hit rate < Y%, latency > Zms, error rate > N%) +- **Budget enforcement**: Hard and soft limits per agent, team, or tag. Soft = alert. Hard = reject requests with 429. +- **Notification channels**: Email, Slack webhook, PagerDuty, generic webhook +- **Anomaly alerts**: Automatic detection of unusual spending patterns without manual threshold configuration + +#### 3.2 Smart Recommendations +Evolve the existing session health recommendations into a broader advisor system: + +- **Model routing suggestions**: "Agent X used Opus for 47 turns where average output was <100 tokens. Haiku would handle these at 1/5 the cost." Based on actual usage patterns, not guesses. +- **Cache optimization**: "Your cache hit rate for Agent Y dropped from 95% to 60% after you updated SOUL.md. The new version breaks prefix cache alignment. Here's why." +- **Prompt trimming**: "TOOLS.md accounts for 12K chars of every request but tools are only called in 8% of turns. Consider lazy-loading tool definitions." +- **Session lifecycle**: "Agent X sessions average 45 turns before context window pressure causes quality degradation. Consider auto-compaction at turn 35." (Extension of existing auto-reset logic.) +- **Cost allocation insights**: "80% of your spend is conversation history re-transmission. Aggressive summarization or session splitting would reduce costs by ~40%." + +#### 3.3 API & Integrations +- **REST API** for all dashboard data (already partially exists — formalize and version it) +- **OpenTelemetry export**: Push metrics to Datadog, Grafana, New Relic, etc. +- **Prometheus `/metrics` endpoint**: For teams with existing Prometheus/Grafana stacks +- **Webhook on events**: Fire webhooks on session reset, budget exceeded, anomaly detected, etc. +- **CSV/JSON export**: Download usage data for custom analysis + +#### 3.4 Session Management (Productize) +Generalize the existing session-manager.sh and auto-reset system: + +- **Session lifecycle policies**: Per-agent rules for when to compact, reset, or alert +- **Session cost tracking**: Total cost per session, not just per turn +- **Session quality scoring**: Detect degradation patterns (growing latency, cache thrashing, increasing error rate) as a session ages +- **Manual session controls**: Reset, pause, or throttle agents from the dashboard (already partially exists — polish and generalize) + +**Phase 3 Deliverable**: An intelligent proxy that actively helps users reduce costs and improve agent performance, with integrations into existing DevOps tooling. + +--- + +### Phase 4: Enterprise & Scale (Weeks 21–30) +**Goal: Ready for teams and organizations.** + +#### 4.1 Multi-User & RBAC +- **Organizations & teams**: Hierarchical structure (org → team → agent) +- **Role-based access**: Admin (full access), Member (view + configure own agents), Viewer (read-only dashboards) +- **SSO**: SAML and OIDC for enterprise identity providers +- **Audit log**: Who changed what configuration, who triggered a session reset, etc. + +#### 4.2 Scaling +- **Horizontal proxy scaling**: Stateless proxy instances behind a load balancer (state lives in Postgres/Timescale) +- **Connection pooling**: Replace per-request httpx clients with a managed pool +- **Request queuing**: Optional rate limiting and request queuing to protect upstream providers during traffic spikes +- **Multi-region**: Deploy proxy instances close to users and upstream providers to minimize latency overhead + +#### 4.3 Security & Compliance +- **API key encryption**: Vault-backed secret storage for provider API keys +- **TLS everywhere**: mTLS between proxy and upstream providers +- **Request/response redaction**: Option to strip or hash sensitive content before logging (PII protection) +- **SOC 2 Type II** preparation (required for enterprise sales in this space) +- **Data residency**: Per-tenant control over where data is stored + +#### 4.4 Advanced Proxy Features +- **Smart routing**: Route requests to the cheapest/fastest provider based on model, latency, and cost rules +- **Automatic fallback**: If Provider A returns 5xx, retry on Provider B transparently +- **Response caching**: Cache identical requests (configurable TTL) to save money on repeated queries +- **Request transformation**: Translate between API formats (e.g., send OpenAI-format requests to Anthropic) — extending the existing developer→system role rewriting + +**Phase 4 Deliverable**: Enterprise-ready platform with team management, security compliance, and the proxy features that make it a proper AI gateway (not just an observer). + +--- + +### Phase 5: Platform & Ecosystem (Weeks 31+) +**Goal: From product to platform.** + +#### 5.1 Managed Cloud Offering +- **Hosted proxy endpoints**: Customers get a dedicated proxy URL (e.g., `https://yourteam.openclaw.dev/v1/messages`) +- **Usage-based pricing**: Free tier → Pro → Enterprise (see Pricing section) +- **Global edge deployment**: Proxy instances on major cloud regions for low-latency forwarding +- **Uptime SLA**: 99.9% for Pro, 99.95% for Enterprise + +#### 5.2 Optional SDK (Deeper Visibility) +For customers who want visibility beyond what a proxy can capture: + +- **Lightweight tracing SDK**: Annotate specific code paths with custom spans (e.g., "RAG retrieval took 200ms and returned 5 chunks") +- **Agent framework integrations**: First-class plugins for LangChain, CrewAI, AutoGen, OpenClaw (your own framework) +- **Hybrid mode**: SDK traces merge with proxy telemetry into a unified timeline + +#### 5.3 Evaluation & Quality +- **Output scoring**: Attach quality scores to responses (manual or automated via LLM-as-judge) +- **Regression detection**: Alert when output quality drops for a given agent/workflow +- **A/B testing**: Route traffic between model variants and compare cost vs. quality +- **Prompt playground**: Test prompt changes against historical inputs and see projected cost/quality impact + +#### 5.4 Community & Marketplace +- **Provider adapter marketplace**: Community-contributed adapters for niche providers +- **Dashboard template sharing**: Pre-built dashboard layouts for common use cases (chatbot monitoring, agent fleet management, batch processing analytics) +- **Open source core**: Core proxy + adapters open source; dashboard, intelligence features, and managed cloud as commercial offerings + +--- + +## Pricing Model (Proposed) + +Based on market analysis, the following structure balances adoption friction with revenue: + +| Tier | Price | Includes | +|------|-------|----------| +| **Free** | $0 | 10K requests/month, 1 agent, 7-day retention, community support | +| **Pro** | $49/month | 500K requests/month, unlimited agents, 90-day retention, alerts & budgets, email support | +| **Team** | $199/month | 2M requests/month, RBAC (up to 10 seats), 1-year retention, smart recommendations, Slack/webhook alerts | +| **Enterprise** | Custom | Unlimited requests, SSO/SAML, audit logs, custom retention, SLA, dedicated support, on-prem/BYOC option | +| **Self-Hosted** | Free (open source core) | Unlimited, community support only. Commercial add-ons for enterprise features. | + +**Why this works:** +- Free tier removes all adoption friction (competitive with Helicone's 10K free, Braintrust's 1M free) +- $49 Pro undercuts Helicone ($79) and Portkey ($49+) while including features they gate behind higher tiers +- Self-hosted option builds trust and community (Langfuse's model proves this works) +- Enterprise tier captures high-value customers who need compliance and SLAs + +--- + +## Technical Architecture (Target State) + +``` + ┌──────────────────────────────┐ + │ Load Balancer │ + │ (nginx / cloud ALB) │ + └──────────┬───────────────────┘ + │ + ┌────────────────┼────────────────┐ + │ │ │ + ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ + │ Proxy │ │ Proxy │ │ Proxy │ + │ Instance │ │ Instance │ │ Instance │ + │ (FastAPI) │ │ (FastAPI) │ │ (FastAPI) │ + └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ + │ │ │ + └────────────────┼────────────────┘ + │ + ┌────────────────┼────────────────┐ + │ │ │ + ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ + │ PostgreSQL │ │ TimescaleDB│ │ Redis │ + │ (config, │ │ (usage │ │ (sessions, │ + │ tenants, │ │ metrics, │ │ rate │ + │ API keys) │ │ time- │ │ limits, │ + │ │ │ series) │ │ cache) │ + └────────────┘ └────────────┘ └────────────┘ + │ + ┌──────▼──────┐ + │ Dashboard │ + │ (Next.js / │ + │ SvelteKit) │ + └─────────────┘ +``` + +### Key Architectural Decisions + +1. **Keep the proxy in Python/FastAPI** — Rewriting in Rust (like Helicone) would reduce latency but massively increase development time. FastAPI with httpx async is fast enough (<10ms overhead) for the initial product. Optimize later if latency becomes a measurable customer concern. + +2. **TimescaleDB over ClickHouse** — TimescaleDB is PostgreSQL-compatible (one fewer technology to operate), handles the insert volume we'll see for the first 1000 customers, and supports continuous aggregates for rollup queries. ClickHouse is better at extreme scale but adds operational complexity. + +3. **Stateless proxy instances** — All state in the database. Proxy instances can scale horizontally behind a load balancer. Sticky sessions not required. + +4. **Provider adapters as Python modules** — Not microservices. A provider adapter is a Python class with 4-5 methods. Loaded at startup based on config. This keeps the deployment simple (one binary/container) while allowing extensibility. + +--- + +## Success Metrics + +### Phase 1 (Foundation) +- Self-hosted deployment works in <15 minutes (docker compose up) +- Supports 3+ providers (Anthropic, OpenAI-compatible, Google) +- <15ms proxy overhead at p99 + +### Phase 2 (Dashboard) +- Dashboard loads in <2 seconds +- Users can answer "how much did Agent X cost this week?" in <10 seconds +- Prompt economics view shows cost attribution data no other tool provides + +### Phase 3 (Intelligence) +- Recommendations surface actionable savings (target: median user finds 20%+ cost reduction opportunity within first week) +- Alert→resolution time under 5 minutes for budget breaches +- 3+ integration channels supported (Slack, email, webhook) + +### Phase 4 (Enterprise) +- SOC 2 Type II compliant +- Supports 100+ concurrent agents per tenant without degradation +- <5 second query time on 90-day aggregations + +### Product-Market Fit Indicators +- Free→Pro conversion rate >5% +- Net revenue retention >120% (teams expand usage over time) +- Weekly active dashboard users >60% of paying customers + +--- + +## Open Questions & Risks + +1. **Build vs. contribute**: Helicone is open source. Should we build from scratch or fork/extend Helicone's proxy layer and differentiate on the intelligence/analytics layer? + +2. **Python performance ceiling**: FastAPI/httpx adds ~5-10ms overhead. Helicone's Rust proxy adds ~50-80ms (but does more work at the edge). Is our Python advantage real, or will we need Rust eventually? + +3. **Prompt decomposition portability**: The current system prompt analysis is tightly coupled to OpenClaw's markdown structure (AGENTS.md, SOUL.md, etc.). How do we generalize this for arbitrary agent frameworks? Possible approach: let users define their own "prompt component" patterns via regex or markers. + +4. **Market timing**: The LLM observability market is crowding fast. Speed to market matters more than feature completeness. The MVP should ship the moment Phase 1 + core Phase 2 views are ready. + +5. **Self-hosted vs. cloud-first**: Langfuse proved that open-source-first builds community and trust. But cloud-hosted generates revenue faster. Recommendation: open source the core proxy from day one, cloud-host the dashboard and intelligence features. + +6. **Naming**: "OpenClaw Token Monitor" describes what it does today. A product name should convey the broader vision. Candidates: "OpenClaw Observatory", "Clawmetrics", or keep "Token Monitor" for its directness. diff --git a/docs/TOKEN-SPY.md b/docs/TOKEN-SPY.md index 0bc7b238c..56301faef 100644 --- a/docs/TOKEN-SPY.md +++ b/docs/TOKEN-SPY.md @@ -244,3 +244,11 @@ token-spy/providers/ ``` Add new providers by subclassing `LLMProvider` and decorating with `@register_provider("name")`. + +--- + +## Further Reading + +- [TOKEN-MONITOR-PRODUCT-SCOPE.md](TOKEN-MONITOR-PRODUCT-SCOPE.md) — Product + roadmap, architecture decisions, and the vision for Token Spy as a standalone + monitoring product diff --git a/docs/cookbook/01-voice-agent-setup.md b/docs/cookbook/01-voice-agent-setup.md new file mode 100644 index 000000000..bc99eadc2 --- /dev/null +++ b/docs/cookbook/01-voice-agent-setup.md @@ -0,0 +1,179 @@ +# Recipe 1: Local Voice Agent System + +*Local AI Cookbook | Lighthouse AI* + +A practical guide for setting up a local voice agent using Whisper, vLLM, and Kokoro. + +--- + +## Components + +| Component | Purpose | Model | +|-----------|---------|-------| +| **Whisper** | Speech-to-text | faster-whisper (medium/large) | +| **vLLM** | Conversation engine | Qwen2.5-32B-AWQ | +| **Kokoro** | Text-to-speech | Kokoro-82M | + +--- + +## Hardware Requirements + +### Minimum (development/testing) +- **CPU:** Intel Core i5 / AMD Ryzen 5 +- **RAM:** 16 GB +- **GPU:** RTX 3060 12GB +- **Storage:** 50 GB SSD + +### Recommended (production) +- **CPU:** Intel Core i7 / AMD Ryzen 7 +- **RAM:** 32 GB +- **GPU:** RTX 4090 24GB or RTX 6000 48GB +- **Storage:** 100 GB NVMe SSD + +--- + +## Software Dependencies + +```bash +# System packages +sudo apt update +sudo apt install -y python3.11 python3.11-venv nvidia-driver-535 docker.io + +# CUDA Toolkit (for GPU acceleration) +wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run +sudo sh cuda_12.1.1_530.30.02_linux.run + +# Verify CUDA +nvidia-smi +``` + +--- + +## Installation + +### 1. Whisper (Speech-to-Text) + +```bash +# Using faster-whisper for better performance +pip install faster-whisper + +# Or via Docker +docker run -d --gpus all \ + -p 8001:8000 \ + --name whisper \ + fedirz/faster-whisper-server:latest-cuda +``` + +### 2. vLLM (Conversation) + +```bash +# Install vLLM +pip install vllm + +# Start server with Qwen 32B (quantized) +# Note: Use Coder variant for code-heavy tasks, base for general +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-32B-Instruct-AWQ \ + --quantization awq \ + --dtype float16 \ + --gpu-memory-utilization 0.9 \ + --max-model-len 32768 \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --port 8000 +``` + +> **Multi-node tip:** If you have multiple GPUs/nodes, run different model variants on each (e.g., Coder on node A, general on node B) and use a proxy for round-robin routing. + +### 3. Kokoro (Text-to-Speech) + +```bash +# Clone and install +git clone https://github.com/hexgrad/kokoro +cd kokoro +pip install -e . + +# Start server +python server.py --port 8002 +``` + +--- + +## Low Latency Configuration + +### Key optimizations: + +1. **Use streaming responses** — Don't wait for complete generation +2. **Enable KV cache** — Reduces repeated computation +3. **Use Flash Attention** — Faster attention mechanism +4. **Optimize batch size** — Balance throughput vs latency + +```python +# vLLM streaming example +from openai import OpenAI + +client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") + +stream = client.chat.completions.create( + model="Qwen/Qwen2.5-32B-Instruct-AWQ", + messages=[{"role": "user", "content": "Hello!"}], + stream=True +) + +for chunk in stream: + print(chunk.choices[0].delta.content, end="") +``` + +--- + +## Common Pitfalls + +| Issue | Cause | Solution | +|-------|-------|----------| +| OOM errors | Model too large | Use AWQ/GPTQ quantization | +| High latency | No GPU | Enable CUDA, check `nvidia-smi` | +| Audio glitches | Buffer underrun | Increase buffer size, use streaming | +| Whisper timeouts | Long audio | Chunk audio into segments | + +--- + +## Performance Tuning + +1. **VRAM allocation:** Set `--gpu-memory-utilization 0.9` for max usage +2. **Context length:** Reduce if not needed (saves memory) +3. **Concurrent requests:** Use `--max-num-seqs` to limit parallel requests +4. **Docker networking:** Use `--network host` for lowest latency + +--- + +## Example Pipeline + +```python +import whisper +from openai import OpenAI +import kokoro + +# 1. Transcribe audio +model = whisper.load_model("medium") +result = model.transcribe("audio.wav") +user_text = result["text"] + +# 2. Generate response +client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") +response = client.chat.completions.create( + model="Qwen/Qwen2.5-32B-Instruct-AWQ", + messages=[{"role": "user", "content": user_text}] +) +assistant_text = response.choices[0].message.content + +# 3. Synthesize speech +audio = kokoro.synthesize(assistant_text) +audio.save("response.wav") +``` + +--- + +**Related:** [research/GPU-TTS-BENCHMARK.md](../research/GPU-TTS-BENCHMARK.md) — +TTS latency benchmarks for GPU vs CPU and concurrency scaling. + +*This recipe is part of the Local AI Cookbook — practical guides for self-hosted AI systems.* diff --git a/docs/cookbook/02-document-qa-setup.md b/docs/cookbook/02-document-qa-setup.md new file mode 100644 index 000000000..2dc3b8be1 --- /dev/null +++ b/docs/cookbook/02-document-qa-setup.md @@ -0,0 +1,190 @@ +# Recipe 2: Local Document Q&A System + +*Local AI Cookbook | Lighthouse AI* + +A practical guide for building a RAG-based document Q&A system with local models. + +--- + +## Components + +| Component | Purpose | Options | +|-----------|---------|---------| +| **Embeddings** | Vector representation | BGE, E5, Sentence Transformers | +| **Vector DB** | Similarity search | Qdrant, ChromaDB, FAISS | +| **LLM** | Answer generation | Qwen, Llama via vLLM | + +--- + +## Hardware Requirements + +### CPU-only (small datasets) +- **RAM:** 16 GB +- **Storage:** 50 GB SSD +- Suitable for: <10K documents, low QPS + +### GPU-accelerated (production) +- **GPU:** RTX 4090 24GB or better +- **RAM:** 32 GB +- **Storage:** 100 GB NVMe SSD +- Suitable for: Large document sets, real-time queries + +--- + +## Choosing an Embedding Model + +| Model | Dimensions | Quality | Speed | +|-------|------------|---------|-------| +| `all-MiniLM-L6-v2` | 384 | Good | Fast | +| `bge-large-en-v1.5` | 1024 | Excellent | Medium | +| `e5-large-v2` | 1024 | Excellent | Medium | +| `nomic-embed-text-v1` | 768 | Very Good | Fast | + +**Recommendation:** Start with `all-MiniLM-L6-v2` for prototyping, upgrade to `bge-large` for production. + +--- + +## Vector Database Setup + +### Option A: Qdrant (Recommended) + +```bash +# Run via Docker +docker run -d --name qdrant \ + -p 6333:6333 \ + -v $(pwd)/qdrant_storage:/qdrant/storage \ + qdrant/qdrant + +# Verify +curl http://localhost:6333/health +``` + +### Option B: ChromaDB (Simpler) + +```bash +pip install chromadb + +# In Python +import chromadb +client = chromadb.PersistentClient(path="./chroma_db") +collection = client.create_collection("documents") +``` + +--- + +## RAG Pipeline Architecture + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ User │────>│ Embeddings │────>│ Vector DB │ +│ Query │ │ Model │ │ Search │ +└─────────────┘ └─────────────┘ └──────┬──────┘ + │ + ┌─────────────┐ ┌──────v──────┐ + │ Answer │<────│ LLM │ + │ │ │ (vLLM) │ + └─────────────┘ └─────────────┘ +``` + +--- + +## Document Chunking Strategies + +### Fixed-size chunks +```python +def chunk_text(text, chunk_size=512, overlap=50): + chunks = [] + for i in range(0, len(text), chunk_size - overlap): + chunks.append(text[i:i + chunk_size]) + return chunks +``` + +### Semantic chunking (better quality) +```python +from langchain.text_splitter import RecursiveCharacterTextSplitter + +splitter = RecursiveCharacterTextSplitter( + chunk_size=500, + chunk_overlap=50, + separators=["\n\n", "\n", ". ", " ", ""] +) +chunks = splitter.split_text(document) +``` + +**Best practices:** +- Chunk size: 256-512 tokens for most use cases +- Overlap: 10-20% of chunk size +- Preserve paragraph/sentence boundaries when possible + +--- + +## Complete Implementation + +```python +from sentence_transformers import SentenceTransformer +from qdrant_client import QdrantClient +from qdrant_client.models import Distance, VectorParams, PointStruct +from openai import OpenAI + +# Initialize components +embedder = SentenceTransformer('all-MiniLM-L6-v2') +qdrant = QdrantClient("localhost", port=6333) +llm = OpenAI(base_url="http://localhost:8000/v1", api_key="none") + +# Create collection +qdrant.create_collection( + collection_name="docs", + vectors_config=VectorParams(size=384, distance=Distance.COSINE) +) + +# Index documents +def index_document(doc_id: str, text: str): + chunks = chunk_text(text) + for i, chunk in enumerate(chunks): + vector = embedder.encode(chunk).tolist() + qdrant.upsert( + collection_name="docs", + points=[PointStruct( + id=f"{doc_id}_{i}", + vector=vector, + payload={"text": chunk, "doc_id": doc_id} + )] + ) + +# Query +def query(question: str, top_k: int = 3): + query_vector = embedder.encode(question).tolist() + results = qdrant.search( + collection_name="docs", + query_vector=query_vector, + limit=top_k + ) + + context = "\n\n".join([r.payload["text"] for r in results]) + + response = llm.chat.completions.create( + model="Qwen/Qwen2.5-32B-Instruct-AWQ", + messages=[ + {"role": "system", "content": f"Answer based on this context:\n{context}"}, + {"role": "user", "content": question} + ] + ) + + return response.choices[0].message.content +``` + +--- + +## Query Optimization Tips + +1. **Hybrid search:** Combine vector + keyword search for better recall +2. **Re-ranking:** Use a cross-encoder to re-rank initial results +3. **Query expansion:** Generate multiple query variations +4. **Metadata filtering:** Use doc type, date, etc. to narrow search + +--- + +**Related:** [research/OSS-MODEL-LANDSCAPE-2026-02.md](../research/OSS-MODEL-LANDSCAPE-2026-02.md) — +Open-source model comparison to help choose the right LLM for your Q&A pipeline. + +*This recipe is part of the Local AI Cookbook — practical guides for self-hosted AI systems.* diff --git a/docs/cookbook/03-code-assistant-setup.md b/docs/cookbook/03-code-assistant-setup.md new file mode 100644 index 000000000..c359c4fdd --- /dev/null +++ b/docs/cookbook/03-code-assistant-setup.md @@ -0,0 +1,254 @@ +# Recipe 3: Local Code Assistant + +*Lighthouse AI Cookbook | 2026-02-09* + +A practical guide for setting up a local code assistant using Qwen2.5-Coder via vLLM. + +--- + +## Components + +| Component | Purpose | Model | +|-----------|---------|-------| +| **vLLM** | Inference server | Qwen2.5-Coder-32B-AWQ | +| **Tool calling** | File ops, shell commands | Hermes parser | +| **Integration** | IDE, CLI, API | VS Code, Continue, etc. | + +--- + +## Hardware Requirements + +| Model Size | GPU | VRAM | Notes | +|------------|-----|------|-------| +| 7B | RTX 3060 12GB | ~6GB | Good for simple tasks | +| 14B | RTX 4070 12GB | ~8GB | Balanced | +| 32B AWQ | RTX 4090 24GB | ~18GB | Best quality | +| 32B AWQ | RTX 6000 48GB | ~18GB | Production with headroom | + +**Recommendation:** Qwen2.5-Coder-32B-AWQ on RTX 4090 for best quality/cost balance. + +--- + +## vLLM Configuration + +### Start the server + +```bash +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \ + --quantization awq \ + --dtype float16 \ + --gpu-memory-utilization 0.9 \ + --max-model-len 32768 \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --port 8000 +``` + +> **Note:** If running on a remote node, replace `localhost` with `` in client configurations below. + +### Key flags explained + +| Flag | Purpose | +|------|---------| +| `--quantization awq` | 4-bit quantization, reduces VRAM | +| `--max-model-len 32768` | Context window size | +| `--enable-auto-tool-choice` | Enable function calling | +| `--tool-call-parser hermes` | Parser for tool calls (critical!) | + +--- + +## Tool Calling Setup + +### Define tools in OpenAI format + +```python +tools = [ + { + "type": "function", + "function": { + "name": "read_file", + "description": "Read contents of a file", + "parameters": { + "type": "object", + "properties": { + "path": {"type": "string", "description": "File path"} + }, + "required": ["path"] + } + } + }, + { + "type": "function", + "function": { + "name": "write_file", + "description": "Write contents to a file", + "parameters": { + "type": "object", + "properties": { + "path": {"type": "string"}, + "content": {"type": "string"} + }, + "required": ["path", "content"] + } + } + }, + { + "type": "function", + "function": { + "name": "run_command", + "description": "Execute a shell command", + "parameters": { + "type": "object", + "properties": { + "command": {"type": "string"} + }, + "required": ["command"] + } + } + } +] +``` + +### Execute tool calls + +```python +from openai import OpenAI +import subprocess +import json + +# Point to your vLLM server (use :8000 if remote) +client = OpenAI(base_url="http://localhost:8000/v1", api_key="none") + +def execute_tool(tool_call): + name = tool_call.function.name + args = json.loads(tool_call.function.arguments) + + if name == "read_file": + with open(args["path"], "r") as f: + return f.read() + elif name == "write_file": + with open(args["path"], "w") as f: + f.write(args["content"]) + return "File written successfully" + elif name == "run_command": + result = subprocess.run(args["command"], shell=True, capture_output=True) + return result.stdout.decode() + +# Use in conversation loop +response = client.chat.completions.create( + model="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", + messages=[{"role": "user", "content": "Read main.py and add error handling"}], + tools=tools +) + +if response.choices[0].message.tool_calls: + for tool_call in response.choices[0].message.tool_calls: + result = execute_tool(tool_call) + # Continue conversation with tool result... +``` + +--- + +## Context Window Management + +### For large codebases + +1. **Selective inclusion:** Only include relevant files +2. **Summarization:** Summarize large files +3. **Chunking:** Process in segments + +```python +def get_codebase_context(paths, max_tokens=16000): + context = [] + total_tokens = 0 + + for path in paths: + with open(path, "r") as f: + content = f.read() + + # Rough token estimate (4 chars per token) + tokens = len(content) // 4 + + if total_tokens + tokens < max_tokens: + context.append(f"# {path}\n```\n{content}\n```") + total_tokens += tokens + else: + # Summarize or truncate + context.append(f"# {path} (truncated)\n```\n{content[:2000]}...\n```") + + return "\n\n".join(context) +``` + +--- + +## Prompt Engineering for Code + +### System prompt template + +``` +You are an expert software engineer. You have access to tools for reading files, writing files, and running commands. + +When asked to modify code: +1. First read the relevant files +2. Understand the existing structure +3. Make minimal, targeted changes +4. Test your changes if possible +5. Explain what you changed and why + +Always write clean, well-documented code that follows best practices. +``` + +### Effective prompts + +| Task | Prompt Style | +|------|--------------| +| Bug fix | "Fix the bug in X where Y happens instead of Z" | +| Feature | "Add a feature that does X. It should work like Y." | +| Refactor | "Refactor X to use Y pattern. Keep behavior identical." | +| Review | "Review this code for issues: security, performance, style" | + +--- + +## Performance: Local vs Cloud + +| Metric | Local (32B AWQ) | Cloud (GPT-4) | +|--------|-----------------|---------------| +| Latency (first token) | 200-500ms | 500-2000ms | +| Throughput | ~30 tok/s | ~50 tok/s | +| Privacy | Complete | Data leaves | +| Cost | $0 per query | $0.03-0.06/1K tokens | +| Availability | 100% | Depends on API | + +**Bottom line:** Local wins on privacy and cost; cloud wins on peak quality for complex tasks. + +--- + +## Integration Options + +### VS Code (Continue extension) +```json +// .continue/config.json +{ + "models": [{ + "title": "Local Qwen Coder", + "provider": "openai", + "model": "Qwen/Qwen2.5-Coder-32B-Instruct-AWQ", + "apiBase": "http://localhost:8000/v1" + }] +} +``` + +### CLI wrapper +```bash +#!/bin/bash +# code-assist.sh +curl -s http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d "{\"model\": \"Qwen/Qwen2.5-Coder-32B-Instruct-AWQ\", \"messages\": [{\"role\": \"user\", \"content\": \"$1\"}]}" \ + | jq -r '.choices[0].message.content' +``` + +--- + +*This recipe is part of the Lighthouse AI Cookbook -- practical guides for self-hosted AI systems.* diff --git a/docs/cookbook/04-privacy-proxy-setup.md b/docs/cookbook/04-privacy-proxy-setup.md new file mode 100644 index 000000000..03a4d0582 --- /dev/null +++ b/docs/cookbook/04-privacy-proxy-setup.md @@ -0,0 +1,272 @@ +# Recipe 4: Privacy-Preserving API Proxy + +*Lighthouse AI Cookbook | 2026-02-09* + +A practical guide for building an API proxy that strips sensitive data before sending to cloud AI. + +--- + +## Use Case + +Organizations that need to: +- Use cloud AI APIs (OpenAI, Anthropic, etc.) +- Protect sensitive data (PII, credentials, internal info) +- Maintain compliance (GDPR, HIPAA, etc.) + +--- + +## Architecture + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Client │────>│ Proxy │────>│ Cloud API │ +│ Request │ │ (anonymize) │ │ (OpenAI) │ +└─────────────┘ └──────┬──────┘ └──────┬──────┘ + │ │ + ┌─────▼─────┐ ┌─────▼─────┐ + │ Entity │ │ Response │ + │ Mapping │◀───────│ (raw) │ + └───────────┘ └───────────┘ + │ + ┌─────▼─────┐ + │ Deanon- │ + │ ymize │ + └─────┬─────┘ + │ + ┌─────▼─────┐ + │ Client │ + │ Response │ + └───────────┘ +``` + +--- + +## Entity Detection Approaches + +### 1. Regex (simple, fast) + +```python +import re + +PATTERNS = { + "email": r"[\w.-]+@[\w.-]+\.\w+", + "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", + "ssn": r"\b\d{3}-\d{2}-\d{4}\b", + "credit_card": r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", + "ip_address": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b", +} + +def detect_with_regex(text): + entities = [] + for entity_type, pattern in PATTERNS.items(): + for match in re.finditer(pattern, text): + entities.append({ + "text": match.group(), + "type": entity_type, + "start": match.start(), + "end": match.end() + }) + return entities +``` + +### 2. Presidio (comprehensive, production-ready) + +```python +from presidio_analyzer import AnalyzerEngine +from presidio_anonymizer import AnonymizerEngine + +analyzer = AnalyzerEngine() +anonymizer = AnonymizerEngine() + +def detect_with_presidio(text, language="en"): + results = analyzer.analyze(text=text, language=language) + return [ + {"text": text[r.start:r.end], "type": r.entity_type, + "start": r.start, "end": r.end, "score": r.score} + for r in results + ] +``` + +### 3. spaCy NER (names, organizations, locations) + +```python +import spacy +nlp = spacy.load("en_core_web_sm") + +def detect_with_spacy(text): + doc = nlp(text) + return [ + {"text": ent.text, "type": ent.label_, + "start": ent.start_char, "end": ent.end_char} + for ent in doc.ents + ] +``` + +--- + +## Anonymization Strategies + +### Redaction (simple) +```python +def redact(text, entities): + for entity in sorted(entities, key=lambda x: x["start"], reverse=True): + text = text[:entity["start"]] + f"[{entity['type']}]" + text[entity["end"]:] + return text +``` + +### Pseudonymization (reversible) +```python +import hashlib + +def pseudonymize(text, entities, mapping=None): + if mapping is None: + mapping = {} + + for entity in sorted(entities, key=lambda x: x["start"], reverse=True): + original = entity["text"] + if original not in mapping: + pseudo = f"ENTITY_{len(mapping):04d}" + mapping[original] = pseudo + text = text[:entity["start"]] + mapping[original] + text[entity["end"]:] + + return text, mapping + +def deanonymize(text, mapping): + reverse_mapping = {v: k for k, v in mapping.items()} + for pseudo, original in reverse_mapping.items(): + text = text.replace(pseudo, original) + return text +``` + +--- + +## Complete Proxy Implementation + +```python +from flask import Flask, request, jsonify +import requests +from presidio_analyzer import AnalyzerEngine +from presidio_anonymizer import AnonymizerEngine + +app = Flask(__name__) +analyzer = AnalyzerEngine() + +# Session storage for multi-turn +sessions = {} + +@app.route("/v1/chat/completions", methods=["POST"]) +def proxy_chat(): + data = request.json + session_id = request.headers.get("X-Session-ID", "default") + + # Get or create session mapping + if session_id not in sessions: + sessions[session_id] = {} + mapping = sessions[session_id] + + # Anonymize messages + anonymized_messages = [] + for msg in data["messages"]: + anon_content, mapping = pseudonymize_with_presidio( + msg["content"], mapping + ) + anonymized_messages.append({ + "role": msg["role"], + "content": anon_content + }) + + # Update session + sessions[session_id] = mapping + + # Forward to real API + data["messages"] = anonymized_messages + response = requests.post( + "https://api.openai.com/v1/chat/completions", + headers={"Authorization": f"Bearer {OPENAI_API_KEY}"}, + json=data + ) + + # Deanonymize response + result = response.json() + if "choices" in result: + for choice in result["choices"]: + choice["message"]["content"] = deanonymize( + choice["message"]["content"], mapping + ) + + return jsonify(result) + +def pseudonymize_with_presidio(text, mapping): + results = analyzer.analyze(text=text, language="en") + + # Sort by position (reverse) for safe replacement + for r in sorted(results, key=lambda x: x.start, reverse=True): + original = text[r.start:r.end] + if original not in mapping: + mapping[original] = f"[{r.entity_type}_{len(mapping):04d}]" + text = text[:r.start] + mapping[original] + text[r.end:] + + return text, mapping + +if __name__ == "__main__": + app.run(port=8085) +``` + +--- + +## Performance Considerations + +| Stage | Latency | Optimization | +|-------|---------|--------------| +| Entity detection | 10-50ms | Use regex for simple patterns | +| Anonymization | 1-5ms | In-memory string ops | +| API call | 500-2000ms | This is the bottleneck | +| Deanonymization | 1-5ms | Cache reverse mappings | + +**Total overhead:** ~15-60ms (negligible vs API latency) + +--- + +## Security Best Practices + +1. **Never log original data** -- Only log anonymized versions +2. **Encrypt mapping storage** -- Session mappings are sensitive +3. **Use TLS** -- All communication encrypted +4. **Audit access** -- Log who accesses what, when +5. **Rotate session mappings** -- Don't reuse indefinitely +6. **Validate inputs** -- Prevent injection attacks + +--- + +## Custom Entity Types + +Add domain-specific patterns: + +```python +# Domain-specific entities +CUSTOM_PATTERNS = { + "service_order": r"SO-\d{6,8}", + "customer_id": r"CID-\d{4,6}", + "equipment_serial": r"[A-Z]{3}\d{8,12}", +} + +# API keys +API_KEY_PATTERNS = { + "openai_key": r"sk-[a-zA-Z0-9]{48}", + "anthropic_key": r"sk-ant-[a-zA-Z0-9-]{95}", + "github_token": r"gh[ps]_[a-zA-Z0-9]{36}", +} +``` + +--- + +## Production Considerations + +A production implementation should include extended recognizers: +- Additional entity types (API keys, cloud credentials, internal IPs) +- Session-based multi-turn conversation support +- Security audit logging and documentation + +--- + +*This recipe is part of the Lighthouse AI Cookbook -- practical guides for self-hosted AI systems.* diff --git a/docs/cookbook/05-multi-gpu-cluster.md b/docs/cookbook/05-multi-gpu-cluster.md new file mode 100644 index 000000000..8a737d97b --- /dev/null +++ b/docs/cookbook/05-multi-gpu-cluster.md @@ -0,0 +1,370 @@ +# Multi-GPU Cluster Setup Guide + +*Lighthouse AI Cookbook -- based on a dual RTX PRO 6000 Blackwell (96GB each) production setup* + +## Hardware Topology + +### NVLink vs PCIe + +| Interconnect | Bandwidth | Best For | +|--------------|-----------|----------| +| NVLink | 600+ GB/s | Tensor parallelism, large model sharding | +| PCIe 4.0 | ~32 GB/s | Independent services, pipeline parallelism | +| PCIe 5.0 | ~64 GB/s | Mixed workloads | + +**When to use NVLink:** +- Running single models split across GPUs (tensor parallelism) +- High inter-GPU communication needed +- Maximum throughput for parallel processing + +**When PCIe is fine:** +- Running separate services on each GPU (LLM on one, STT on another) +- Independent workloads with minimal data sharing +- Cost-sensitive deployments + +## Load Balancing Strategies + +### Round-Robin +``` +Request 1 → GPU 0 +Request 2 → GPU 1 +Request 3 → GPU 0 +... +``` +**Best for:** Evenly balanced, stateless workloads + +### VRAM-Based Routing +``` +if GPU_0.vram_free > GPU_1.vram_free: + route_to(GPU_0) +else: + route_to(GPU_1) +``` +**Best for:** Variable-size requests, preventing OOM + +### Least-Connections +``` +route_to(gpu_with_fewest_active_requests) +``` +**Best for:** Requests with variable processing time + +### Model Sharding +``` +[Model Layer 1-16] → GPU 0 +[Model Layer 17-32] → GPU 1 +``` +**Best for:** Models too large for single GPU + +## vLLM Multi-GPU Configuration + +### Tensor Parallelism (TP) +Splits model layers horizontally across GPUs. + +```bash +# 2-GPU tensor parallel +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-72B-Instruct \ + --tensor-parallel-size 2 \ + --port 8000 +``` + +**When to use:** +- Model too large for single GPU VRAM +- GPUs connected via NVLink +- Latency-sensitive (single request uses all GPUs) + +### Pipeline Parallelism (PP) +Splits model layers vertically (sequential stages). + +```bash +# 2-GPU pipeline parallel +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-72B-Instruct \ + --pipeline-parallel-size 2 \ + --port 8000 +``` + +**When to use:** +- High throughput needed (batch processing) +- GPUs on PCIe (lower bandwidth OK) +- Can tolerate slightly higher latency + +### Hybrid (TP + PP) +```bash +# 4 GPUs: 2 TP x 2 PP +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen2.5-72B-Instruct \ + --tensor-parallel-size 2 \ + --pipeline-parallel-size 2 \ + --port 8000 +``` + +## Smart Proxy Architecture + +### Health Check Script + +```python +#!/usr/bin/env python3 +"""GPU cluster health checker.""" + +import requests +import subprocess +import json + +NODES = [ + {"name": "node_a", "ip": "NODE_A_IP", "port": 9100}, + {"name": "node_b", "ip": "NODE_B_IP", "port": 9100}, +] + +def check_vllm_health(node): + """Check if vLLM is responding.""" + try: + r = requests.get(f"http://{node['ip']}:{node['port']}/v1/models", timeout=5) + return r.status_code == 200 + except Exception: + return False + +def check_embeddings_health(ip, port=9103): + """Check if embeddings service is responding (uses different endpoint).""" + try: + r = requests.post( + f"http://{ip}:{port}/v1/embeddings", + json={"input": "test", "model": "default"}, + timeout=5 + ) + return r.status_code == 200 + except Exception: + return False + +def get_gpu_stats(node): + """Get GPU stats via SSH.""" + try: + result = subprocess.run( + ["ssh", f"@{node['ip']}", + "nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader,nounits"], + capture_output=True, text=True, timeout=10 + ) + util, mem_used, mem_total, temp = result.stdout.strip().split(", ") + return { + "gpu_util": int(util), + "vram_used_gb": int(mem_used) / 1024, + "vram_total_gb": int(mem_total) / 1024, + "temp_c": int(temp) + } + except Exception: + return None + +def cluster_status(): + """Get full cluster status.""" + status = {"healthy": True, "nodes": []} + for node in NODES: + node_status = { + "name": node["name"], + "vllm_healthy": check_vllm_health(node), + "gpu": get_gpu_stats(node) + } + if not node_status["vllm_healthy"]: + status["healthy"] = False + status["nodes"].append(node_status) + return status + +if __name__ == "__main__": + print(json.dumps(cluster_status(), indent=2)) +``` + +### Failover Logic + +```python +def route_request(request): + """Route to healthiest available GPU.""" + status = cluster_status() + + available = [n for n in status["nodes"] if n["vllm_healthy"]] + + if not available: + raise Exception("No healthy nodes!") + + if len(available) == 1: + return available[0] # Only option + + # Route to GPU with most free VRAM + return min(available, key=lambda n: n["gpu"]["vram_used_gb"]) +``` + +## Service Distribution + +### Recommended Layout (Dual 96GB GPUs) + +``` +GPU 0 (Node A) — "Coder" GPU 1 (Node B) — "Sage" +├── vLLM: Qwen2.5-Coder-32B ├── vLLM: Qwen2.5-32B +├── VRAM: ~35GB ├── VRAM: ~35GB +└── Role: Code, tool calling └── Role: General, research + +Shared Services (either GPU, load balanced): +├── Whisper STT (~2GB) +├── Kokoro TTS (~1GB) +├── Embeddings (~1GB) +└── Headroom for spikes +``` + +### Alternative: Specialized Nodes + +``` +GPU 0 — "Voice Stack" GPU 1 — "LLM Heavy" +├── Whisper Large-v3 (3GB) ├── vLLM: Qwen2.5-72B (TP=1) +├── Kokoro TTS (1GB) ├── VRAM: ~80GB +├── Small LLM for routing └── Role: Complex reasoning +└── VRAM: ~20GB total +``` + +## Monitoring + +### nvidia-smi One-Liner + +```bash +watch -n 1 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv' +``` + +### Prometheus + DCGM Exporter + +```yaml +# docker-compose.monitoring.yml +services: + dcgm-exporter: + image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.7-3.1.4-ubuntu22.04 + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + ports: + - "9400:9400" + + prometheus: + image: prom/prometheus:latest + volumes: + - ./prometheus.yml:/etc/prometheus/prometheus.yml + ports: + - "9090:9090" +``` + +### Alert Thresholds + +| Metric | Warning | Critical | +|--------|---------|----------| +| GPU Utilization | >90% sustained | >95% for 5min | +| VRAM Usage | >85% | >95% | +| Temperature | >75C | >83C | +| Request Latency | >5s p95 | >10s p95 | + +## Example Configs + +### Nginx Load Balancer + +```nginx +upstream vllm_cluster { + least_conn; + server NODE_A_IP:9100 weight=1 max_fails=3 fail_timeout=30s; + server NODE_B_IP:9100 weight=1 max_fails=3 fail_timeout=30s; +} + +upstream whisper_cluster { + least_conn; + server NODE_A_IP:9101; + server NODE_B_IP:9101; +} + +server { + listen 9100; + + location / { + proxy_pass http://vllm_cluster; + proxy_connect_timeout 60s; + proxy_read_timeout 300s; # LLM can be slow + proxy_set_header X-Real-IP $remote_addr; + + # Add routing header for debugging + add_header X-Routed-To $upstream_addr; + } + + location /health { + access_log off; + return 200 "healthy\n"; + } +} +``` + +### HAProxy with Health Checks + +```haproxy +global + log stdout format raw local0 + +defaults + log global + mode http + option httplog + option dontlognull + timeout connect 10s + timeout client 300s + timeout server 300s + +frontend vllm_frontend + bind *:9100 + default_backend vllm_backend + +backend vllm_backend + balance leastconn + option httpchk GET /v1/models + http-check expect status 200 + + server node_a NODE_A_IP:8000 check inter 5s fall 3 rise 2 + server node_b NODE_B_IP:8000 check inter 5s fall 3 rise 2 +``` + +### Status Endpoint (Python/FastAPI) + +```python +from fastapi import FastAPI +import httpx + +app = FastAPI() + +NODES = [ + {"name": "node_a", "url": "http://NODE_A_IP:9100"}, + {"name": "node_b", "url": "http://NODE_B_IP:9100"}, +] + +@app.get("/status") +async def cluster_status(): + status = [] + async with httpx.AsyncClient(timeout=5) as client: + for node in NODES: + try: + r = await client.get(f"{node['url']}/v1/models") + status.append({"node": node["name"], "healthy": True}) + except Exception: + status.append({"node": node["name"], "healthy": False}) + return {"nodes": status, "healthy": all(n["healthy"] for n in status)} +``` + +## Quick Start Checklist + +1. [ ] Verify GPU topology: `nvidia-smi topo -m` +2. [ ] Install vLLM on all nodes +3. [ ] Configure load balancer (nginx/haproxy) +4. [ ] Set up health checks +5. [ ] Configure monitoring (prometheus + grafana) +6. [ ] Test failover by killing one node +7. [ ] Load test to find capacity limits +8. [ ] Document your specific configuration + +--- + +**Related:** [research/HARDWARE-GUIDE.md](../research/HARDWARE-GUIDE.md) — +GPU buying guide with tier rankings, used market analysis, and what NOT to buy. + +*This recipe is part of the Lighthouse AI Cookbook -- practical guides for self-hosted AI systems.* diff --git a/docs/cookbook/06-swarm-patterns.md b/docs/cookbook/06-swarm-patterns.md new file mode 100644 index 000000000..fc2fed659 --- /dev/null +++ b/docs/cookbook/06-swarm-patterns.md @@ -0,0 +1,313 @@ +# Sub-Agent Swarm Patterns + +*How to effectively use multiple AI agents in parallel with OpenClaw + local Qwen* + +## When to Swarm + +### Good for Parallelization +- Research across multiple topics +- Document processing (chunk -> process -> merge) +- Testing multiple scenarios +- Data transformation pipelines +- Independent API calls + +### Keep Sequential +- Tasks with strict dependencies +- Stateful conversations +- Tasks requiring intermediate human review +- Single complex reasoning chains + +**Rule of thumb:** If subtasks don't need each other's output, parallelize. + +## Spawn Patterns + +### 1. Fan-Out / Fan-In + +Spawn multiple agents, collect all results. + +```javascript +// Fan-out: Spawn research agents for each topic +const topics = ["M1 local AI", "M2 voice agents", "M3 privacy"]; +const results = []; + +for (const topic of topics) { + sessions_spawn({ + task: `Research ${topic} and summarize findings`, + label: `research-${topic.replace(/\s/g, '-')}` + }); +} + +// Fan-in: Results come back via announcements +// Aggregate in MEMORY.md or a dedicated file +``` + +**Real example:** A mission research sweep can spawn 9 agents in parallel. + +### 2. Pipeline + +Sequential stages, each feeding the next. + +```javascript +// Stage 1: Extract +sessions_spawn({ + task: "Extract all code snippets from the document", + label: "pipeline-extract" +}); + +// Stage 2: Transform (after Stage 1 completes) +sessions_spawn({ + task: "Convert extracted snippets to Python 3.12 syntax", + label: "pipeline-transform" +}); + +// Stage 3: Load (after Stage 2 completes) +sessions_spawn({ + task: "Save transformed code to repository with tests", + label: "pipeline-load" +}); +``` + +**Best for:** ETL workflows, document processing chains. + +### 3. Hierarchical Delegation + +Manager agent spawns worker agents. + +```javascript +// Manager task +sessions_spawn({ + task: `You are a research coordinator. Break this problem into 3-5 subtasks + and spawn sub-agents for each. Aggregate their findings into a report. + + Problem: How can we optimize voice agent latency?`, + label: "research-manager" +}); +``` + +**Best for:** Complex problems that need decomposition. + +## Task Decomposition + +### Chunking Strategy + +```javascript +// Bad: One agent processes everything +sessions_spawn({ task: "Process all 1000 documents" }); + +// Good: Chunk into parallelizable batches +const CHUNK_SIZE = 50; +for (let i = 0; i < documents.length; i += CHUNK_SIZE) { + const chunk = documents.slice(i, i + CHUNK_SIZE); + sessions_spawn({ + task: `Process documents ${i} to ${i + CHUNK_SIZE}`, + label: `chunk-${i}` + }); +} +``` + +### Decomposition Heuristics + +| Task Type | Decomposition | +|-----------|---------------| +| Research | By topic/question | +| Documents | By file or page range | +| Testing | By test case | +| Analysis | By data partition | + +## Result Aggregation + +### File-Based Aggregation + +```javascript +// Each agent writes to a numbered file +sessions_spawn({ + task: `Research topic X. Write findings to research/topic-x.md`, + label: "research-x" +}); + +// Aggregator agent combines them +sessions_spawn({ + task: `Read all files in research/*.md and create a combined summary`, + label: "aggregator" +}); +``` + +### Memory-Based Aggregation + +```javascript +// Agents update shared MEMORY.md +sessions_spawn({ + task: `Research X. Add key findings to MEMORY.md under "## Research Results"`, + label: "research-x" +}); +``` + +### Structured Output + +```javascript +// Request JSON for easier parsing +sessions_spawn({ + task: `Analyze the codebase. Output as JSON: + {"files_analyzed": N, "issues": [...], "recommendations": [...]}`, + label: "code-analysis" +}); +``` + +## Error Handling + +### Timeout Protection + +```javascript +sessions_spawn({ + task: "Complex research task", + label: "risky-task", + runTimeoutSeconds: 300 // Kill if takes > 5 minutes +}); +``` + +### Graceful Degradation + +```javascript +// Spawn with fallback awareness +sessions_spawn({ + task: `Try to complete this analysis. If you encounter errors or + can't complete, output: {"status": "failed", "reason": "..."}. + Partial results are acceptable.`, + label: "best-effort" +}); +``` + +### Retry Pattern + +```javascript +// Main agent retries failed sub-tasks +const MAX_RETRIES = 2; +let attempt = 0; + +while (attempt < MAX_RETRIES) { + const result = await sessions_spawn({ task: "..." }); + if (result.status === "success") break; + attempt++; +} +``` + +## Resource Management + +### Concurrency Limits + +```javascript +// OpenClaw config (openclaw.json) +{ + "ai": { + "subAgent": { + "maxConcurrent": 20, // Max parallel agents + "model": "local-vllm/Qwen/Qwen2.5-Coder-32B-Instruct-AWQ" + } + } +} +``` + +### GPU-Aware Spawning + +```javascript +// Check GPU before spawning heavy tasks +const status = await fetch("http://localhost:9199/status"); +const gpuUtil = status.nodes[0].gpu_utilization; + +if (gpuUtil > 80) { + console.log("GPU busy, queuing task for later"); +} else { + sessions_spawn({ task: "Heavy computation" }); +} +``` + +### Staggered Spawning + +```javascript +// Don't flood the GPU — stagger spawns +for (const task of tasks) { + sessions_spawn({ task }); + await sleep(2000); // 2 second gap between spawns +} +``` + +## Real Examples + +### Research Parallelization + +```javascript +// Example: parallel mission research +const missions = ["M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9"]; + +for (const mission of missions) { + sessions_spawn({ + task: `Research ${mission} from MISSIONS.md. Provide practical findings. + Output analysis as text (no file operations).`, + label: `research-${mission}`, + runTimeoutSeconds: 300 + }); +} + +// Result: 9 research docs in ~5 minutes +``` + +### Document Processing + +```javascript +// Process PDFs in parallel +const pdfs = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]; + +for (const pdf of pdfs) { + sessions_spawn({ + task: `Extract key information from ${pdf}: + - Main topics + - Key dates + - Action items + Save to processed/${pdf.replace('.pdf', '.md')}`, + label: `process-${pdf}` + }); +} +``` + +### Test Suite Parallelization + +```javascript +// Run test scenarios in parallel +const scenarios = [ + "happy path booking", + "cancellation flow", + "reschedule with conflict", + "emergency escalation" +]; + +for (const scenario of scenarios) { + sessions_spawn({ + task: `Generate test cases for: ${scenario} + Include: inputs, expected outputs, edge cases`, + label: `test-${scenario.replace(/\s/g, '-')}` + }); +} +``` + +## Patterns We've Learned + +### What Works +- Pure reasoning tasks complete reliably +- Short, focused prompts +- Explicit output format requests +- File-based result aggregation + +### What Doesn't (Until proxy v2.1) +- Heavy tool use in sub-agents +- Multi-step file operations +- Complex git workflows +- Chained tool calls + +### Optimal Task Size +- **Too small:** Overhead dominates (< 30 seconds of work) +- **Too large:** Risk of timeout or drift (> 10 minutes) +- **Sweet spot:** 1-5 minutes of focused work + +--- + +*Lighthouse AI Cookbook -- battle-tested swarm patterns on local Qwen 32B* diff --git a/docs/cookbook/08-n8n-local-llm.md b/docs/cookbook/08-n8n-local-llm.md new file mode 100644 index 000000000..84dfed851 --- /dev/null +++ b/docs/cookbook/08-n8n-local-llm.md @@ -0,0 +1,529 @@ +# Recipe 06: n8n + Local LLM Integration + +*Automate workflows with AI using your own hardware* + +--- + +## Overview + +n8n is a powerful workflow automation platform. Combined with local LLMs, you get: +- **Private AI automation** — data never leaves your network +- **Zero API costs** — no per-call pricing +- **Full control** — customize everything + +**Difficulty:** Intermediate | **Time:** 2-4 hours | **Prerequisites:** Basic Docker, REST APIs + +--- + +## What is n8n? + +n8n is an open-source workflow automation tool — think Zapier but self-hosted. It connects apps, services, and APIs through visual workflows. + +**Why n8n + Local AI?** +- Build email auto-responders without sending data to OpenAI +- Create document processors that run entirely on-premise +- Automate reports with AI analysis using your own GPUs + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────┐ +│ Your Network │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ Trigger │ ───► │ n8n │ ───► │ Action │ │ +│ │ (email, │ │ Workflow │ │ (Slack, │ │ +│ │ webhook,│ │ │ │ email, │ │ +│ │ cron) │ │ │ │ │ file) │ │ +│ └──────────┘ │ │ │ └──────────┘ │ +│ │ ▼ │ │ +│ │ ┌─────┐ │ ┌──────────┐ │ +│ │ │HTTP │──┼────► │ vLLM │ │ +│ │ │Node │ │ │ (Local) │ │ +│ │ └─────┘ │ └──────────┘ │ +│ └──────────┘ │ +│ │ +└─────────────────────────────────────────────────────────┘ +``` + +--- + +## Setup + +### 1. n8n Installation (Docker) + +**docker-compose.yml:** +```yaml +version: '3.8' + +services: + n8n: + image: n8nio/n8n:latest + ports: + - "5678:5678" + environment: + - N8N_BASIC_AUTH_ACTIVE=true + - N8N_BASIC_AUTH_USER=admin + - N8N_BASIC_AUTH_PASSWORD=changeme + - N8N_HOST=localhost + - N8N_PORT=5678 + - N8N_PROTOCOL=http + - WEBHOOK_URL=http://localhost:5678/ + volumes: + - n8n_data:/home/node/.n8n + +volumes: + n8n_data: +``` + +**Start:** +```bash +docker-compose up -d +# Access at http://localhost:5678 +``` + +### 2. vLLM Setup + +If you don't have vLLM running, add it to the compose: + +```yaml + vllm: + image: vllm/vllm-openai:latest + ports: + - "8000:8000" + command: > + --model Qwen/Qwen2.5-32B-Instruct-AWQ + --gpu-memory-utilization 0.9 + --max-model-len 32768 + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] +``` + +--- + +## Connecting n8n to vLLM + +### HTTP Request Node Configuration + +In n8n, use the **HTTP Request** node to call your local LLM: + +**Settings:** +- **Method:** POST +- **URL:** `http://localhost:8000/v1/chat/completions` +- **Authentication:** None (or Bearer if configured) +- **Body Content Type:** JSON + +**JSON Body:** +```json +{ + "model": "Qwen/Qwen2.5-32B-Instruct-AWQ", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "{{ $json.message }}"} + ], + "temperature": 0.7, + "max_tokens": 1000 +} +``` + +**Extract Response:** +Add a **Set** node after to extract the response: +``` +{{ $json.choices[0].message.content }} +``` + +--- + +## Example Workflows + +### 1. Document Summarization Pipeline + +**Trigger:** File uploaded to folder +**Process:** Extract text → Summarize with LLM → Save summary + +``` +[Watch Folder] → [Read Binary File] → [Extract Text] → [HTTP Request (LLM)] → [Write File] +``` + +**HTTP Request Body:** +```json +{ + "model": "Qwen/Qwen2.5-32B-Instruct-AWQ", + "messages": [ + {"role": "system", "content": "Summarize the following document in 3-5 bullet points."}, + {"role": "user", "content": "{{ $json.text }}"} + ], + "max_tokens": 500 +} +``` + +--- + +### 2. Email Auto-Response + +**Trigger:** New email received +**Process:** Classify intent → Generate response → Queue for review + +``` +[Email Trigger] → [HTTP Request (Classify)] → [IF Node] → [HTTP Request (Generate)] → [Send Email] +``` + +**Classification prompt:** +```json +{ + "messages": [ + {"role": "system", "content": "Classify this email as: support, sales, spam, or other. Reply with just the category."}, + {"role": "user", "content": "Subject: {{ $json.subject }}\n\n{{ $json.body }}"} + ] +} +``` + +**Response generation:** +```json +{ + "messages": [ + {"role": "system", "content": "Draft a professional response to this email. Be helpful and concise."}, + {"role": "user", "content": "Email:\n{{ $json.body }}\n\nDraft a response:"} + ] +} +``` + +--- + +### 3. Slack/Discord Bot + +**Trigger:** Slack message in channel +**Process:** Call LLM → Reply in thread + +``` +[Slack Trigger] → [HTTP Request (LLM)] → [Slack (Reply)] +``` + +**Slack app configuration:** +1. Create Slack App at api.slack.com +2. Add "chat:write" and "channels:read" scopes +3. Install to workspace +4. Use OAuth token in n8n Slack credential + +**LLM Prompt:** +```json +{ + "messages": [ + {"role": "system", "content": "You are a helpful team assistant. Answer questions concisely."}, + {"role": "user", "content": "{{ $json.event.text }}"} + ] +} +``` + +--- + +### 4. RAG Pipeline with Webhooks + +**Trigger:** Webhook call +**Process:** Embed query → Search vectors → Generate with context + +``` +[Webhook] → [HTTP (Embeddings)] → [HTTP (Qdrant)] → [HTTP (LLM)] → [Respond to Webhook] +``` + +**Embeddings call:** +```json +POST http://:8001/v1/embeddings +{ + "model": "BAAI/bge-large-en-v1.5", + "input": "{{ $json.query }}" +} +``` + +**Qdrant search:** +```json +POST http://:6333/collections/docs/points/search +{ + "vector": {{ $json.data[0].embedding }}, + "limit": 5, + "with_payload": true +} +``` + +**LLM with context:** +```json +{ + "messages": [ + {"role": "system", "content": "Answer the question using only the provided context."}, + {"role": "user", "content": "Context:\n{{ $json.result.map(r => r.payload.text).join('\n\n') }}\n\nQuestion: {{ $node['Webhook'].json.query }}"} + ] +} +``` + +--- + +### 5. Automated Report Generation + +**Trigger:** Cron (daily/weekly) +**Process:** Fetch data → Analyze with LLM → Generate report → Email + +``` +[Schedule Trigger] → [HTTP (API)] → [HTTP (LLM Analysis)] → [Convert to PDF] → [Send Email] +``` + +**Analysis prompt:** +```json +{ + "messages": [ + {"role": "system", "content": "Analyze this data and provide insights in a professional report format with sections: Summary, Key Findings, Recommendations."}, + {"role": "user", "content": "Data:\n{{ JSON.stringify($json.data, null, 2) }}"} + ] +} +``` + +--- + +## Error Handling + +### Retry on Failure + +In HTTP Request node settings: +- **On Error:** Continue (using error output) +- **Retry on Fail:** Yes +- **Max Tries:** 3 +- **Wait Between Tries:** 1000ms + +### Timeout Handling + +LLM calls can be slow. Configure: +- **Timeout:** 120000 (2 minutes) + +### Error Notification + +Add an **IF** node to check for errors: +``` +{{ $json.error !== undefined }} +``` + +Then route to notification (Slack, email). + +--- + +## Credential Management + +### Store API Keys Securely + +In n8n, use **Credentials** for sensitive data: + +1. Go to Settings → Credentials +2. Create "Header Auth" credential +3. Name: `Authorization` +4. Value: `Bearer your-api-key` + +Use in HTTP Request: +- **Authentication:** Predefined Credential Type +- **Credential Type:** Header Auth + +### Environment Variables + +For sensitive data, use env vars: + +```yaml +environment: + - VLLM_API_KEY=${VLLM_API_KEY} +``` + +Access in n8n: `{{ $env.VLLM_API_KEY }}` + +--- + +## Performance Considerations + +### 1. Batch Processing + +For bulk operations, use the **Split In Batches** node: +- Process 10 items at a time +- Prevents overwhelming the LLM + +### 2. Caching + +Add a **Redis** node to cache frequent queries: +``` +[Check Cache] → [IF Found] → [Return Cached] + ↓ (not found) + [LLM] → [Store in Cache] → [Return] +``` + +### 3. Queue Long Jobs + +For heavy processing, use a queue: +- **RabbitMQ** or **Redis** for job queue +- Separate worker for LLM calls +- Webhook callback when complete + +### 4. Model Selection + +Route based on complexity: +``` +[IF Simple] → [Fast Model (7B)] + ↓ +[Complex] → [Large Model (32B)] +``` + +--- + +## Scaling Workflows + +### Horizontal Scaling + +Run multiple n8n workers: + +```yaml +services: + n8n: + image: n8nio/n8n:latest + deploy: + replicas: 3 + environment: + - EXECUTIONS_MODE=queue + - QUEUE_BULL_REDIS_HOST=redis +``` + +### Webhook Load Balancing + +Use nginx in front: + +```nginx +upstream n8n { + server n8n1:5678; + server n8n2:5678; + server n8n3:5678; +} +``` + +### Separate LLM Workers + +Dedicate GPUs to different tasks: +- Assign fast models for simple queries +- Assign large models for complex reasoning +- Use a load balancer for round-robin distribution + +--- + +## Common Pitfalls + +| Problem | Cause | Solution | +|---------|-------|----------| +| Timeout errors | LLM too slow | Increase timeout to 120s+ | +| JSON parse fails | LLM returns malformed JSON | Add "respond only with valid JSON" to prompt | +| Rate limiting | Too many concurrent calls | Add delays, use batching | +| Memory issues | Large payloads | Stream or chunk large documents | +| Wrong model | Hardcoded model name | Use variables for flexibility | + +--- + +## Templates + +### Basic LLM Call Function +Save as reusable workflow: + +```json +{ + "name": "LLM Call", + "nodes": [ + { + "type": "n8n-nodes-base.httpRequest", + "parameters": { + "method": "POST", + "url": "={{ $workflow.variables.vllm_url }}/v1/chat/completions", + "bodyContentType": "json", + "body": { + "model": "={{ $workflow.variables.model }}", + "messages": "={{ $input.all() }}" + } + } + } + ] +} +``` + +### Error Handler Template +```json +{ + "nodes": [ + { + "type": "n8n-nodes-base.if", + "parameters": { + "conditions": { + "string": [{"value1": "={{ $json.error }}", "operation": "isNotEmpty"}] + } + } + }, + { + "type": "n8n-nodes-base.slack", + "parameters": { + "channel": "#alerts", + "message": "Workflow error: {{ $json.error }}" + } + } + ] +} +``` + +--- + +## Complete Example: Support Ticket Triage + +**Full workflow:** + +1. **Email Trigger** — New support email +2. **Extract Data** — Get subject, body, sender +3. **Classify Priority** — LLM call to determine urgency +4. **Extract Entities** — LLM call to get product, issue type +5. **Create Ticket** — API call to ticketing system +6. **Route** — Assign to appropriate team +7. **Auto-Reply** — Generate and send acknowledgment + +**Workflow JSON snippet:** +```json +{ + "nodes": [ + { + "name": "Email Trigger", + "type": "n8n-nodes-base.emailReadImap" + }, + { + "name": "Classify Priority", + "type": "n8n-nodes-base.httpRequest", + "parameters": { + "url": "http://localhost:8000/v1/chat/completions", + "body": { + "messages": [ + {"role": "system", "content": "Classify this support email priority as: critical, high, medium, low. Reply with just the priority."}, + {"role": "user", "content": "Subject: {{ $json.subject }}\n\n{{ $json.text }}"} + ] + } + } + } + ] +} +``` + +--- + +## Next Steps + +1. **Build your first workflow** — Start with document summarization +2. **Add monitoring** — Track success rates and latency +3. **Create templates** — Reusable LLM nodes for common tasks +4. **Explore integrations** — n8n has 400+ integrations to connect + +--- + +## References + +- [n8n Documentation](https://docs.n8n.io) +- [vLLM OpenAI Compatible Server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) +- [n8n Community Workflows](https://n8n.io/workflows) diff --git a/docs/cookbook/README.md b/docs/cookbook/README.md new file mode 100644 index 000000000..8afd74fa0 --- /dev/null +++ b/docs/cookbook/README.md @@ -0,0 +1,48 @@ +# Local AI Cookbook + +Step-by-step practical recipes for self-hosted AI systems. Each recipe is standalone — pick the one that matches what you're building. + +## Recipes + +| # | Recipe | What You'll Build | GPU Required? | +|---|--------|------------------|---------------| +| 01 | [Voice Agent Setup](01-voice-agent-setup.md) | Whisper STT + vLLM + Kokoro TTS pipeline | Yes | +| 02 | [Document Q&A](02-document-qa-setup.md) | RAG system with Qdrant/ChromaDB + local LLM | Optional | +| 03 | [Code Assistant](03-code-assistant-setup.md) | Tool-calling code agent with file ops | Yes | +| 04 | [Privacy Proxy](04-privacy-proxy-setup.md) | PII-stripping proxy for cloud API calls | No | +| 05 | [Multi-GPU Cluster](05-multi-gpu-cluster.md) | Load-balanced multi-node GPU inference | Yes (2+) | +| 06 | [Swarm Patterns](06-swarm-patterns.md) | Sub-agent parallelization and coordination | Yes | +| 08 | [n8n + Local LLM](08-n8n-local-llm.md) | Workflow automation with local models | Yes | +| — | [Agent Template](agent-template-code.md) | Code specialist agent with debugging protocol | Yes | + +## I Want To... + +| Goal | Start With | +|------|-----------| +| Run a voice assistant locally | [Recipe 01](01-voice-agent-setup.md) | +| Search my documents with AI | [Recipe 02](02-document-qa-setup.md) | +| Build a local code copilot | [Recipe 03](03-code-assistant-setup.md) | +| Use cloud AI without leaking data | [Recipe 04](04-privacy-proxy-setup.md) | +| Scale across multiple GPUs | [Recipe 05](05-multi-gpu-cluster.md) | +| Run multiple agents in parallel | [Recipe 06](06-swarm-patterns.md) | +| Automate workflows with AI | [Recipe 08](08-n8n-local-llm.md) | +| Set up a coding agent from scratch | [Agent Template](agent-template-code.md) | + +## Prerequisites + +All recipes assume you have: +- A Linux machine (Ubuntu 22.04+ recommended) +- Python 3.10+ +- Docker installed + +GPU recipes additionally need: +- NVIDIA GPU with CUDA support +- NVIDIA Container Toolkit +- vLLM installed (see [SETUP.md](../SETUP.md) for base installation) + +## Related Docs + +- [SETUP.md](../SETUP.md) — Base vLLM + OpenClaw installation +- [HARDWARE-GUIDE.md](../research/HARDWARE-GUIDE.md) — GPU buying guide with real benchmarks +- [ARCHITECTURE.md](../ARCHITECTURE.md) — How the tool call proxy works +- [PATTERNS.md](../PATTERNS.md) — Transferable patterns for persistent agents diff --git a/docs/cookbook/agent-template-code.md b/docs/cookbook/agent-template-code.md new file mode 100644 index 000000000..a437a1a78 --- /dev/null +++ b/docs/cookbook/agent-template-code.md @@ -0,0 +1,469 @@ +# Agent Template: Code Specialist + +> **Purpose:** Python development, debugging, and code generation with tool-assisted workflows. +> **Use when:** You need to write, refactor, debug, or review Python code with multi-turn assistance. + +--- + +## Agent Overview + +The **Code Specialist** is a coding-focused agent optimized for Python development. It uses file reading, editing, and execution tools to assist with code writing, debugging, refactoring, and review tasks. Designed for local Qwen 2.5 32B deployment with efficient tool calling patterns. + +### Why This Agent? + +| Problem | Solution | +|---------|----------| +| Boilerplate code writing | Generate complete, working implementations | +| Debugging mysteries | Systematic analysis with execution feedback | +| Refactoring fear | Incremental changes with validation | +| Code review gaps | Automated first-pass analysis | +| Documentation drift | Sync docs with code changes | + +### Best Suited For + +- **New feature development** — From spec to working code +- **Bug fixing** — Root cause analysis and patch generation +- **Code modernization** — Refactoring legacy code +- **Test generation** — Unit tests, integration tests +- **Documentation** — Docstrings, READMEs, API docs + +--- + +## Configuration + +### Required Configuration + +```yaml +# .openclaw/agents/code-specialist.yaml +name: code-specialist +model: local-qwen-32b # Optimized for local deployment + +# Core tools +tools: + - read # Read source files + - edit # Modify code + - write # Create new files + - exec # Run tests and scripts + +# Optional context +context: + - pyproject.toml # Project configuration + - README.md # Project overview + - .cursorrules # Coding preferences + - tests/ # Test patterns +``` + +### Local Model Optimization + +```yaml +# For Qwen 2.5 32B local deployment +model_config: + max_tokens: 4096 # Stay within context limits + temperature: 0.3 # Deterministic for code + top_p: 0.9 + + # Tool calling optimized + stop_sequences: + - "```" + - "" +``` + +--- + +## System Prompt + +```markdown +You are an expert Python developer and code specialist. Your purpose is to help write, +debug, refactor, and review Python code. You work methodically, using tools to interact +with the codebase and provide working solutions. + +## Core Principles + +1. **Write working code** — Every suggestion should be runnable +2. **Test incrementally** — Run code frequently to catch errors early +3. **Explain your reasoning** — Why this approach, what alternatives considered +4. **Respect existing patterns** — Match the codebase's style and conventions +5. **Be explicit about assumptions** — State what you're assuming about the environment + +## Tool Usage Patterns + +### Reading Code +- Read the full file context before suggesting changes +- Use `read` with offset/limit for large files +- Check related files (imports, tests, configs) + +### Editing Code +- Make minimal, focused changes +- Preserve existing formatting and style +- Edit only what's necessary + +### Running Code +- Use `exec` to validate changes +- Run tests after modifications +- Check exit codes and output + +### Response Format + +For code tasks, structure your response as: + +**Analysis:** +- What needs to be done +- Key considerations +- Potential pitfalls + +**Solution:** +```python +# The actual code +``` + +**Verification:** +- How to test this +- Expected behavior +- Edge cases handled + +## Debugging Protocol + +1. **Reproduce** — Get the exact error message and stack trace +2. **Isolate** — Find the minimal code that triggers the issue +3. **Hypothesize** — Form theories about root cause +4. **Test** — Use exec to verify hypotheses +5. **Fix** — Apply minimal fix +6. **Verify** — Confirm the fix works + +## Code Style Guidelines + +- Follow PEP 8 unless project specifies otherwise +- Use type hints for function signatures +- Write docstrings for public APIs +- Prefer explicit over implicit +- Handle errors gracefully + +## When to Ask for Clarification + +- Ambiguous requirements +- Missing context about the codebase +- Security-sensitive operations +- Breaking changes to APIs +``` + +--- + +## Example Conversations + +### Example 1: Writing a New Function + +**User:** Write a function to parse CSV files with validation. + +**Agent:** I'll create a robust CSV parser with validation. Let me first check if there's existing code to follow. + + + +src/utils.py + + + +[Reads existing utilities to match patterns] + +**Agent:** Now I'll write the CSV parser following the project's patterns: + +```python +# src/parsers/csv_parser.py +"""CSV parsing with validation.""" + +import csv +from pathlib import Path +from typing import Iterator, Callable + + +class CSVValidationError(Exception): + """Raised when CSV validation fails.""" + pass + + +def parse_csv_with_validation( + file_path: Path, + validator: Callable[[dict], bool], + required_columns: list[str] | None = None +) -> Iterator[dict]: + """ + Parse CSV file with row-by-row validation. + + Args: + file_path: Path to CSV file + validator: Function that returns True for valid rows + required_columns: Columns that must exist in header + + Yields: + Validated rows as dictionaries + + Raises: + CSVValidationError: If header or row validation fails + FileNotFoundError: If file doesn't exist + """ + if not file_path.exists(): + raise FileNotFoundError(f"CSV file not found: {file_path}") + + with open(file_path, 'r', newline='', encoding='utf-8') as f: + reader = csv.DictReader(f) + + # Validate header + if required_columns: + missing = set(required_columns) - set(reader.fieldnames or []) + if missing: + raise CSVValidationError( + f"Missing required columns: {missing}" + ) + + # Parse and validate rows + for row_num, row in enumerate(reader, start=2): + if not validator(row): + raise CSVValidationError( + f"Row {row_num} failed validation: {row}" + ) + yield row +``` + +**Verification:** + + + +python -c " +from src.parsers.csv_parser import parse_csv_with_validation +import tempfile +import os + +# Create test CSV +with tempfile.NamedTemporaryFile(mode='w', suffix='.csv', delete=False) as f: + f.write('name,age,city\n') + f.write('Alice,30,NYC\n') + f.write('Bob,25,LA\n') + temp_path = f.name + +try: + # Test with validator + def validate_age(row): + return int(row['age']) > 18 + + rows = list(parse_csv_with_validation(temp_path, validate_age, ['name', 'age'])) + print(f'Parsed {len(rows)} valid rows') + for row in rows: + print(f' {row}') +finally: + os.unlink(temp_path) +" + + + +Output: +``` +Parsed 2 valid rows + {'name': 'Alice', 'age': '30', 'city': 'NYC'} + {'name': 'Bob', 'age': '25', 'city': 'LA'} +``` + +The function works correctly with validation. + +--- + +### Example 2: Debugging an Error + +**User:** I'm getting `KeyError: 'user_id'` in my API handler. + +**Agent:** Let's debug this systematically. First, let me see the code and the full error. + + + +src/api/handlers.py + + + +[Reads the handler code] + +**Agent:** I see the issue. The `user_id` is being accessed directly without checking if the key exists: + +```python +# Problematic line +user_id = request.json['user_id'] # KeyError if missing +``` + +**Fix:** Use `.get()` with validation: + +```python +# Fixed version +user_id = request.json.get('user_id') +if not user_id: + return jsonify({'error': 'user_id required'}), 400 +``` + + + +src/api/handlers.py + user_id = request.json['user_id'] + user_id = request.json.get('user_id') + if not user_id: + return jsonify({'error': 'user_id required'}), 400 + + + +**Verification:** + + + +python -m pytest tests/test_api.py::test_handler_missing_user_id -v + + + +Test passes — error handling now works correctly. + +--- + +### Example 3: Refactoring Legacy Code + +**User:** Refactor this deeply nested function into something cleaner. + +**Agent:** Let me read the code first to understand the structure. + + + +src/processing.py + + + +[Reads deeply nested function] + +**Agent:** I'll refactor using early returns and helper functions: + +```python +# Before: Deep nesting +# After: Flat structure with guard clauses + +def process_order(order_id: str) -> dict: + """Process an order with validation.""" + order = fetch_order(order_id) + if not order: + return {'error': 'Order not found'} + + if order['status'] != 'pending': + return {'error': f"Order is {order['status']}"} + + if not validate_inventory(order['items']): + return {'error': 'Insufficient inventory'} + + return execute_order(order) + + +def fetch_order(order_id: str) -> dict | None: + """Fetch order from database.""" + # Implementation... + + +def validate_inventory(items: list) -> bool: + """Check if all items are available.""" + # Implementation... + + +def execute_order(order: dict) -> dict: + """Execute the validated order.""" + # Implementation... +``` + + + +python -m pytest tests/test_processing.py -v + + + +All tests pass — refactoring successful. + +--- + +## Best Practices for Local Models + +### Prompt Engineering + +1. **Be explicit about output format** — Local models need clear structure +2. **Use examples liberally** — Show, don't just tell +3. **Keep context manageable** — Read relevant files, not entire codebase +4. **Break complex tasks into steps** — Multi-turn over single massive response + +### Tool Calling Optimization + +```python +# Good: Clear tool sequence +read(file) → edit(file) → exec(test) + +# Avoid: Ambiguous operations +"Fix the bugs" (too vague) +``` + +### Context Management + +- Read files before editing +- Verify changes with exec +- Document assumptions +- Handle errors gracefully + +### Response Length + +- Keep responses under 2000 tokens when possible +- Use continuation for long outputs +- Summarize when appropriate +- Show key parts, reference rest + +--- + +## Integration Examples + +### VS Code Extension + +```json +{ + "name": "Code Specialist", + "command": "openclaw agent run code-specialist", + "keybinding": "ctrl+shift+c" +} +``` + +### Git Hook + +```bash +#!/bin/bash +# .git/hooks/pre-commit +openclaw agent run code-specialist --task "review-staged-changes" +``` + +### CI/CD Pipeline + +```yaml +# .github/workflows/code-review.yml +- name: AI Code Review + run: openclaw agent run code-specialist --pr ${{ github.event.pull_request.number }} +``` + +--- + +## Troubleshooting + +### Common Issues + +| Issue | Solution | +|-------|----------| +| Model generates incorrect code | Add more examples, be more explicit | +| Tool calls fail | Check paths, verify file existence | +| Responses too verbose | Request concise output | +| Context overflow | Read smaller chunks, summarize | + +### Performance Tips + +1. **Warm up the model** — Run a simple query first +2. **Batch similar operations** — Group related edits +3. **Cache file reads** — Don't re-read unchanged files +4. **Use explicit stop sequences** — Prevent runaway generation + +--- + +## Version History + +| Version | Date | Changes | +|---------|------|---------| +| 1.0.0 | 2026-02-12 | Initial template | diff --git a/docs/research/GPU-TTS-BENCHMARK.md b/docs/research/GPU-TTS-BENCHMARK.md new file mode 100644 index 000000000..656a1e3e6 --- /dev/null +++ b/docs/research/GPU-TTS-BENCHMARK.md @@ -0,0 +1,104 @@ +# GPU TTS Benchmark Results + +**Date:** 2026-02-10 +**Tested on:** Local infrastructure +**Hardware:** RTX PRO 6000 Blackwell (96GB VRAM) + +> **Note:** Single test run — use as baseline guidance, not statistical proof. + +## Summary + +Upgraded Kokoro TTS from CPU to GPU (v0.2.4-master with PyTorch 2.8 for RTX 50 series support). + +**Result:** 3x single-request speedup, ~50-100% capacity increase for voice pipeline. + +## Test Configuration + +- **Old:** `ghcr.io/remsky/kokoro-fastapi-cpu:latest` (7 months old) +- **New:** `ghcr.io/remsky/kokoro-fastapi-gpu:v0.2.4-master` (CUDA 12.9, PyTorch 2.8) +- **VRAM after upgrade:** 91/98GB (93% - tight but stable) + +## Single Request Latency + +| Component | CPU TTS | GPU TTS | Improvement | +|-----------|---------|---------|-------------| +| TTS only | 228ms | 77ms | **3x faster** | + +## Concurrent TTS Scaling + +| Concurrent | CPU Batch* | GPU Batch | Per-Request (GPU) | +|------------|-----------|-----------|-------------------| +| 5 | ~1200ms | 410ms | 82ms | +| 10 | ~2500ms | 790ms | 79ms | +| 20 | ~5000ms | 1640ms | 82ms | +| 50 | ~12s | 4200ms | 84ms | + +*CPU estimates extrapolated from previous stress test degradation pattern + +**Key Finding:** GPU TTS maintains ~80ms/request regardless of concurrency. CPU TTS degrades linearly. + +## Full Voice Pipeline (STT→LLM→TTS) + +Test: Simulated voice call with LLM response (~80 tokens) + TTS synthesis + +| Concurrent Calls | Total Batch Time | Per-Call Latency | +|------------------|------------------|------------------| +| 5 | 1117ms | 688-1114ms | +| 10 | 1584ms | 684-1581ms | +| 20 | 2545ms | 712-2542ms | + +### Component Breakdown at 20 Concurrent + +- **LLM:** 555-983ms (scales well, vLLM batching works) +- **TTS:** 154-1669ms (starts queuing after ~10 concurrent) + +## Capacity Estimate + +**Target:** <2s end-to-end latency (acceptable for voice) + +| Configuration | Concurrent Calls | Notes | +|---------------|------------------|-------| +| Single GPU (CPU TTS) | 10-15 | TTS bottleneck | +| Single GPU (GPU TTS) | 15-20 | LLM becomes bottleneck | +| Dual GPU cluster | 30-40 | With load balancing | + +**Improvement:** GPU TTS increases practical capacity by **50-100%** + +## VRAM Impact + +``` +Before (CPU TTS): ~89GB used +After (GPU TTS): ~91GB used (+2GB for Kokoro model) +``` + +Still within 98GB envelope. No memory pressure observed. + +## Deployment Notes + +```bash +# Start GPU Kokoro +docker run -d --gpus all --name kokoro-tts-gpu \ + -p 8880:8880 \ + -e USE_GPU=true \ + --restart unless-stopped \ + ghcr.io/remsky/kokoro-fastapi-gpu:v0.2.4-master +``` + +Requires: +- NVIDIA driver with CUDA 12.9+ support +- `--gpus all` flag +- RTX 50 series: needs v0.2.4+ (PyTorch 2.8 support) + +## Conclusion + +GPU TTS is a clear win: +- 3x faster single request +- Near-linear scaling under load +- Minimal VRAM overhead +- Increases voice call capacity by 50-100% + +Bottleneck shifts from TTS to LLM at high concurrency. For >20 concurrent calls, would need second LLM instance or smaller model. + +--- + +*Benchmark scripts can be adapted from standard TTS and pipeline stress-test patterns for your environment.* diff --git a/docs/research/HARDWARE-GUIDE.md b/docs/research/HARDWARE-GUIDE.md new file mode 100644 index 000000000..55a7d786b --- /dev/null +++ b/docs/research/HARDWARE-GUIDE.md @@ -0,0 +1,252 @@ +# Dream Server Hardware Guide + +*Last updated: 2026-02-09* + +> **Note:** Prices as of February 2026. + +What to buy for local AI at different budgets. + +--- + +## TL;DR Recommendations + +| Tier | GPU | RAM | What You Get | +|------|-----|-----|--------------| +| Starter ($800-1,200) | RTX 3060 12GB | 32GB | 7B-14B models, basic chat | +| Professional ($2,000-3,000) | RTX 4070 Ti Super 16GB | 64GB | 32B models, voice, 5-8 users | +| Business ($4,000-6,000) | RTX 4090 24GB | 128GB | 70B models, 10-20 users | +| Enterprise ($12,000-18,000) | 2x RTX 4090 | 256GB | 40+ concurrent users | + +--- + +## Tier 1: Starter ($800-1,200) + +**Goal:** Get started with local AI, personal use + +### Recommended Build +- **GPU:** RTX 3060 12GB (used: $200-250) +- **CPU:** Any modern 6+ core (i5-12400, Ryzen 5 5600) +- **RAM:** 32GB DDR4 +- **Storage:** 500GB NVMe SSD +- **PSU:** 550W 80+ Bronze + +### What Runs +- 7B-14B models (Qwen2.5-7B, Llama-3-8B) +- Basic voice (Whisper small/medium) +- Single user, personal projects +- Slow with complex prompts (~30 tok/s) + +### Buy Used +Look for: +- Dell Precision/HP Z workstations with RTX 3060 +- Avoid: GTX cards (no FP16), AMD (CUDA issues) + +--- + +## Tier 2: Professional ($2,000-3,000) + +**Goal:** Serious local AI, small team use + +### Recommended Build +- **GPU:** RTX 4070 Ti Super 16GB ($800) or RTX 4080 16GB ($1000) +- **CPU:** i7-13700 or Ryzen 7 7700X +- **RAM:** 64GB DDR5 +- **Storage:** 1TB NVMe Gen4 +- **PSU:** 750W 80+ Gold + +### What Runs +- 32B AWQ quantized models (Qwen2.5-32B-AWQ) +- Full voice pipeline (Whisper medium + Piper) +- 5-8 concurrent users +- ~50-60 tok/s generation + +### Best Value +RTX 4070 Ti Super at $800 is the sweet spot for: +- 16GB VRAM (critical for 32B models) +- Good efficiency (200W TDP) +- DLSS 3 for future-proofing + +--- + +## Tier 3: Business ($4,000-6,000) + +**Goal:** Production workloads, growing business + +### Recommended Build +- **GPU:** RTX 4090 24GB ($1800-2000) +- **CPU:** i9-14900K or Ryzen 9 7950X +- **RAM:** 128GB DDR5 +- **Storage:** 2TB NVMe Gen4 +- **PSU:** 1000W 80+ Platinum +- **Cooling:** AIO or custom loop (4090 runs hot) + +### What Runs +- 70B AWQ models (Llama-3-70B-AWQ, Qwen2.5-72B-AWQ) +- Multiple models simultaneously +- 10-15 concurrent users +- Full RAG + embeddings + voice + +### Alternative: Dual 4070 Ti +Two RTX 4070 Ti Super (32GB total) can be better than one 4090 for: +- Running separate specialized models +- Redundancy +- But: More complex setup, higher power + +--- + +## Tier 4: Enterprise ($12,000-18,000) + +**Goal:** Full production, organization-wide + +### Option A: Dual RTX 4090 +- 2x RTX 4090 (48GB VRAM total) +- Requires: PCIe bifurcation, 1500W+ PSU +- Good for: Separate model instances + +### Option B: RTX 6000 Ada (48GB) +- Single GPU, 48GB VRAM +- Runs: 70B at FP16 (no quantization) +- Pro: Simpler than dual-GPU +- Con: $6000+ + +### Option C: Dual RTX PRO 6000 Blackwell (Our Production Setup) + +> **Note:** The dual PRO 6000 configuration is our production setup and represents a high-end reference point, not a standard recommendation. Results below reflect this specific hardware. + +- 2x 96GB VRAM (192GB total) +- Runs: Multiple 70B models, 40+ users +- Cost: ~$15-20k total build + +### Capacity (Real-World Numbers) +From benchmarks on our production dual PRO 6000 setup: + +| Use Case | Per GPU | Both GPUs | +|----------|---------|-----------| +| Voice agents (<2s) | 10-20 | 20-40 | +| Interactive chat (<5s) | ~50 | ~100 | +| Batch processing | 100+ | 200+ | + +--- + +## Best Value Picks + +Based on price/performance analysis: + +### Hidden Gem: Used RTX 3090 ($700-900) +At used prices, the RTX 3090 offers: +- 24GB VRAM (same as 4090!) +- 936 GB/s bandwidth (better than new 4080 SUPER) +- Runs 32B+ models that 16GB cards can't +- ~75% of 4090 performance at ~50% cost + +**Trade-off:** Higher power (350W), older architecture + +### Memory Bandwidth Insight +Token generation is **memory-bound**, not compute-bound. This is why: +- RTX 3080 Ti (912 GB/s) matches newer cards in inference +- Used high-bandwidth cards punch above their weight + +### Quick Value Table + +| Budget | Best Pick | Why | +|--------|-----------|-----| +| $250 | Used RTX 3060 12GB | Entry, can run 7B-14B | +| $500 | Used RTX 3080 Ti 12GB | Great bandwidth for price | +| $700-900 | **Used RTX 3090** | **Best overall value** | +| $800 | New RTX 4070 Ti SUPER | Best new 16GB card | +| $1,600 | RTX 4090 | Maximum single-GPU | + +--- + +## Key Specs Explained + +### VRAM (Most Important) +VRAM determines what models fit. Rough guide: + +| VRAM | Max Model (AWQ 4-bit) | +|------|----------------------| +| 8GB | 7B | +| 12GB | 14B | +| 16GB | 32B | +| 24GB | 70B | +| 48GB | 70B FP16 or 2x 32B | + +### Memory Bandwidth +Faster bandwidth = faster inference + +| GPU | Bandwidth | Relative Speed | +|-----|-----------|----------------| +| RTX 3060 | 360 GB/s | 1.0x | +| RTX 4070 Ti | 504 GB/s | 1.4x | +| RTX 4090 | 1008 GB/s | 2.8x | +| PRO 6000 | 1792 GB/s | 5.0x | + +### System RAM +Rule: 2x your model size minimum + +| Model | Min RAM | Recommended | +|-------|---------|-------------| +| 7B | 16GB | 32GB | +| 32B | 32GB | 64GB | +| 70B | 64GB | 128GB | + +--- + +## What NOT to Buy + +- **GTX 16xx/10xx** — No FP16 tensor cores +- **AMD GPUs** — CUDA issues, ROCm limited +- **Intel Arc** — Driver problems, limited support +- **Cloud GPUs (H100/A100)** — Can't buy, rental only +- **8GB cards** — Too limited for serious use + +--- + +## Where to Buy + +### New +- Newegg, Amazon, Micro Center +- EVGA B-Stock (refurbished) +- Manufacturer direct (MSI, ASUS) + +### Used +- eBay (check seller ratings) +- r/hardwareswap +- Facebook Marketplace (local pickup) +- Mining cards: Usually fine, verify fans work + +--- + +## Power Considerations + +| GPU | TDP | PSU Needed | +|-----|-----|------------| +| RTX 3060 | 170W | 550W | +| RTX 4070 Ti | 285W | 700W | +| RTX 4090 | 450W | 1000W | +| Dual 4090 | 900W | 1500W | + +Add 150-200W for CPU + system overhead. + +--- + +## Cooling + +- **Single GPU:** Good case airflow is enough +- **RTX 4090:** AIO or very good air cooling (315W slot power) +- **Dual GPU:** Custom loop or enterprise chassis + +--- + +## Summary + +1. **Starter:** RTX 3060 12GB — personal use, getting started +2. **Professional:** RTX 4070 Ti Super 16GB — serious work, small teams +3. **Business:** RTX 4090 24GB — production workloads, 10-20 users +4. **Enterprise:** Dual 4090 — organization-wide, 40+ users + +**VRAM is king.** Buy the most VRAM you can afford. + +--- + +*Based on real-world multi-GPU testing* diff --git a/docs/research/OSS-MODEL-LANDSCAPE-2026-02.md b/docs/research/OSS-MODEL-LANDSCAPE-2026-02.md new file mode 100644 index 000000000..370a91778 --- /dev/null +++ b/docs/research/OSS-MODEL-LANDSCAPE-2026-02.md @@ -0,0 +1,68 @@ +# OSS LLM Landscape — February 2026 + +*Research: Open Source > Closed Systems* + +## Latest Releases + +### Qwen3 Series (Released April 2025) +- **Flagship:** Qwen3-235B-A22B (MoE architecture, 1T+ effective parameters) +- **Capabilities:** + - 119 languages supported + - 92.3% accuracy on AIME25 (math benchmark) + - 74.1% on LiveCodeBench v6 (real-world coding) +- **Key variants:** + - Qwen3-Coder-30B-A3B-Instruct — coding specialist + - Qwen3-30B-A3B-Thinking-2507 — reasoning focus + - Qwen3-VL — multimodal (vision+language) +- **MoE Benefits:** Sparse activation means ~22B active params from 235B total + +### New Players to Watch + +| Model | Size | Specialty | Notes | +|-------|------|-----------|-------| +| **GLM-4.5-Air** | ? | Agent workflows | Top-rated for tool calling | +| **MiMo-V2-Flash** | ~100B? | Software engineering | Beats DeepSeek-V3.2 | +| **GPT-OSS 20B** | 20B | Reasoning, tool use | Runs on consumer GPUs | +| **DeepSeek-V3.2** | 236B | General | Strong baseline | + +## Best Models for Agent/Tool Calling (2026) + +Based on [SiliconFlow benchmarks](https://www.siliconflow.com/benchmarks): + +1. **GLM-4.5-Air** — Best overall for agent workflows +2. **Qwen3-Coder-30B-A3B-Instruct** — Best for code-heavy agents +3. **Qwen3-30B-A3B-Thinking** — Best for reasoning chains +4. **GPT-OSS 20B** — Best for consumer hardware + +## What We're Running + +Our testing used **Qwen2.5-Coder-32B-Instruct-AWQ**: +- Works on dual RTX 4090 (via vLLM) +- Good for sub-agent tasks +- ~50-60% success rate on autonomous agent tasks +- Loop bug with tool call format (emits JSON in text) + +**Upgrade candidates:** +- Qwen3-Coder-30B-A3B — when AWQ quantization available +- GLM-4.5-Air — if we get hardware for larger model + +## Comparison: Qwen2.5 vs Qwen3 + +| Feature | Qwen2.5 (Current) | Qwen3 | +|---------|-------------------|-------| +| Languages | ~30 | 119 | +| Math (AIME25) | ~70% | 92.3% | +| Code (LiveCodeBench) | ~60% | 74.1% | +| MoE | No (dense) | Yes (sparse) | +| Context | 32K | 128K+ | + +## Recommendations (Fully Local) + +1. **Current setup works** — Qwen2.5-32B is sufficient for sub-agents +2. **Upgrade path:** Wait for Qwen3-30B AWQ quantizations +3. **For tool calling:** Consider GLM-4.5-Air when feasible +4. **Stay with vLLM** — Best throughput for our hardware + +--- + +*Sources: llm-stats.com, Qwen blog, [SiliconFlow](https://www.siliconflow.com), BentoML* diff --git a/docs/research/README.md b/docs/research/README.md new file mode 100644 index 000000000..5f4b82948 --- /dev/null +++ b/docs/research/README.md @@ -0,0 +1,17 @@ +# Research + +Technical research and benchmarks from running persistent LLM agents on local hardware. These are real-world findings, not theoretical estimates. + +## Documents + +| Document | What It Covers | +|----------|---------------| +| [HARDWARE-GUIDE.md](HARDWARE-GUIDE.md) | GPU buying guide — tiers, prices, what NOT to buy, used market analysis | +| [GPU-TTS-BENCHMARK.md](GPU-TTS-BENCHMARK.md) | Text-to-speech latency benchmarks (GPU vs CPU, concurrency scaling) | +| [OSS-MODEL-LANDSCAPE-2026-02.md](OSS-MODEL-LANDSCAPE-2026-02.md) | Open-source model comparison — Qwen, Llama, tool-calling success rates | + +## How These Relate to the Rest + +- **Building a cluster?** Start with [HARDWARE-GUIDE.md](HARDWARE-GUIDE.md), then follow [cookbook/05-multi-gpu-cluster.md](../cookbook/05-multi-gpu-cluster.md) +- **Setting up voice agents?** Check [GPU-TTS-BENCHMARK.md](GPU-TTS-BENCHMARK.md) for latency expectations, then [cookbook/01-voice-agent-setup.md](../cookbook/01-voice-agent-setup.md) +- **Choosing a model?** Read [OSS-MODEL-LANDSCAPE-2026-02.md](OSS-MODEL-LANDSCAPE-2026-02.md), then see [SETUP.md](../SETUP.md) for deployment diff --git a/guardian/README.md b/guardian/README.md index f1650b21e..01cc7fd07 100644 --- a/guardian/README.md +++ b/guardian/README.md @@ -64,8 +64,8 @@ sudo dnf install python3 curl iproute e2fsprogs psmisc ```bash # Clone (or copy the guardian/ directory) -git clone https://github.com/Light-Heart-Labs/Android-Framework.git -cd Android-Framework/guardian +git clone https://github.com/Light-Heart-Labs/Lighthouse-AI.git +cd Lighthouse-AI/guardian # Copy and edit the config cp guardian.conf.example guardian.conf diff --git a/install.ps1 b/install.ps1 index 7482fbbe7..0a222f243 100644 --- a/install.ps1 +++ b/install.ps1 @@ -1,6 +1,6 @@ # ═══════════════════════════════════════════════════════════════ -# Android Framework - Windows Installer -# https://github.com/Light-Heart-Labs/Android-Framework +# Lighthouse AI - Windows Installer +# https://github.com/Light-Heart-Labs/Lighthouse-AI # # Usage: # .\install.ps1 # Interactive install @@ -34,7 +34,7 @@ function Err($msg) { Write-Host "[FAIL] $msg" -ForegroundColor Red } # ── Banner ───────────────────────────────────────────────────── Write-Host "" Write-Host "===========================================================" -ForegroundColor Cyan -Write-Host " Android Framework - Windows Installer" -ForegroundColor Cyan +Write-Host " Lighthouse AI - Windows Installer" -ForegroundColor Cyan Write-Host "===========================================================" -ForegroundColor Cyan Write-Host "" @@ -145,7 +145,7 @@ $TokenSpyTaskName = "OpenClawTokenSpy" # ── Uninstall ────────────────────────────────────────────────── if ($Uninstall) { - Info "Uninstalling Android Framework..." + Info "Uninstalling Lighthouse AI..." # Remove scheduled task if (Get-ScheduledTask -TaskName $CleanupTaskName -ErrorAction SilentlyContinue) { @@ -209,7 +209,7 @@ if (-not $ProxyOnly -and -not $TokenSpyOnly) { $CleanupScript = Join-Path $OpenClawDir "session-cleanup.ps1" $cleanupContent = @" -# Android Framework - Session Cleanup (Windows) +# Lighthouse AI - Session Cleanup (Windows) # Auto-generated by install.ps1 `$SessionsDir = "$SessionsDir" @@ -297,7 +297,7 @@ Write-Output "[`$(Get-Date)] Cleanup complete: removed `$removedInactive inactiv $settings = New-ScheduledTaskSettingsSet -AllowStartIfOnBatteries -DontStopIfGoingOnBatteries -StartWhenAvailable $principal = New-ScheduledTaskPrincipal -UserId $env:USERNAME -LogonType S4U -RunLevel Limited - Register-ScheduledTask -TaskName $CleanupTaskName -Action $action -Trigger $trigger -Settings $settings -Principal $principal -Description "Android Framework - Cleanup every ${IntervalMinutes}min" | Out-Null + Register-ScheduledTask -TaskName $CleanupTaskName -Action $action -Trigger $trigger -Settings $settings -Principal $principal -Description "Lighthouse AI - Cleanup every ${IntervalMinutes}min" | Out-Null Ok "Scheduled task created: $CleanupTaskName (every ${IntervalMinutes}min)" } @@ -390,7 +390,7 @@ SESSION_CHAR_LIMIT=$TsSessionCharLimit $tsTrigger = New-ScheduledTaskTrigger -AtLogOn $tsSettings = New-ScheduledTaskSettingsSet -AllowStartIfOnBatteries -DontStopIfGoingOnBatteries -StartWhenAvailable -ExecutionTimeLimit (New-TimeSpan -Days 365) - Register-ScheduledTask -TaskName $TokenSpyTaskName -Action $tsAction -Trigger $tsTrigger -Settings $tsSettings -Description "Android Framework - Token Spy on :$TsPort" | Out-Null + Register-ScheduledTask -TaskName $TokenSpyTaskName -Action $tsAction -Trigger $tsTrigger -Settings $tsSettings -Description "Lighthouse AI - Token Spy on :$TsPort" | Out-Null Ok "Scheduled task created: $TokenSpyTaskName (starts at logon)" # Start it now diff --git a/install.sh b/install.sh index 2504eadc4..ac4aef810 100644 --- a/install.sh +++ b/install.sh @@ -1,7 +1,7 @@ #!/bin/bash # ═══════════════════════════════════════════════════════════════ -# Android Framework - Installer -# https://github.com/Light-Heart-Labs/Android-Framework +# Lighthouse AI - Installer +# https://github.com/Light-Heart-Labs/Lighthouse-AI # # Usage: # ./install.sh # Interactive install @@ -9,6 +9,7 @@ # ./install.sh --cleanup-only # Only install session cleanup # ./install.sh --proxy-only # Only install tool proxy # ./install.sh --token-spy-only # Only install Token Spy API monitor +# ./install.sh --cold-storage-only # Only install LLM Cold Storage timer # ./install.sh --uninstall # Remove everything # ═══════════════════════════════════════════════════════════════ @@ -19,6 +20,7 @@ CONFIG_FILE="$SCRIPT_DIR/config.yaml" CLEANUP_ONLY=false PROXY_ONLY=false TOKEN_SPY_ONLY=false +COLD_STORAGE_ONLY=false UNINSTALL=false # ── Colors ───────────────────────────────────────────────────── @@ -41,6 +43,7 @@ while [[ $# -gt 0 ]]; do --cleanup-only) CLEANUP_ONLY=true; shift ;; --proxy-only) PROXY_ONLY=true; shift ;; --token-spy-only) TOKEN_SPY_ONLY=true; shift ;; + --cold-storage-only) COLD_STORAGE_ONLY=true; shift ;; --uninstall) UNINSTALL=true; shift ;; -h|--help) echo "Usage: ./install.sh [options]" @@ -50,6 +53,7 @@ while [[ $# -gt 0 ]]; do echo " --cleanup-only Only install session cleanup" echo " --proxy-only Only install vLLM tool proxy" echo " --token-spy-only Only install Token Spy API monitor" + echo " --cold-storage-only Only install LLM Cold Storage timer" echo " --uninstall Remove all installed components" echo " -h, --help Show this help" exit 0 @@ -61,7 +65,7 @@ done # ── Banner ───────────────────────────────────────────────────── echo "" echo -e "${CYAN}═══════════════════════════════════════════════════════════${NC}" -echo -e "${CYAN} Android Framework - Installer${NC}" +echo -e "${CYAN} Lighthouse AI - Installer${NC}" echo -e "${CYAN}═══════════════════════════════════════════════════════════${NC}" echo "" @@ -136,6 +140,14 @@ TS_SESSION_CHAR_LIMIT=$(parse_yaml "token_spy.session_char_limit" "200000") TS_AGENT_SESSION_DIRS=$(parse_yaml "token_spy.agent_session_dirs" "") TS_LOCAL_MODEL_AGENTS=$(parse_yaml "token_spy.local_model_agents" "") +# LLM Cold Storage settings +CS_ENABLED=$(parse_yaml "llm_cold_storage.enabled" "false") +CS_HF_CACHE=$(parse_yaml "llm_cold_storage.hf_cache_dir" "~/.cache/huggingface/hub") +CS_HF_CACHE="${CS_HF_CACHE/#\~/$HOME}" +CS_COLD_DIR=$(parse_yaml "llm_cold_storage.cold_dir" "~/llm-cold-storage") +CS_COLD_DIR="${CS_COLD_DIR/#\~/$HOME}" +CS_MAX_IDLE_DAYS=$(parse_yaml "llm_cold_storage.max_idle_days" "7") + # System user SYSTEM_USER=$(parse_yaml "system_user" "") if [ -z "$SYSTEM_USER" ]; then @@ -157,11 +169,14 @@ fi if [ "$CLEANUP_ONLY" = false ] && [ "$PROXY_ONLY" = false ]; then info " Token Spy: $([ "$TS_ENABLED" = "true" ] && echo "enabled on :$TS_PORT ($TS_AGENT_NAME)" || echo "disabled")" fi +if [ "$COLD_STORAGE_ONLY" = true ] || ([ "$CLEANUP_ONLY" = false ] && [ "$PROXY_ONLY" = false ] && [ "$TOKEN_SPY_ONLY" = false ]); then + info " Cold Storage: $([ "$CS_ENABLED" = "true" ] && echo "enabled (idle >${CS_MAX_IDLE_DAYS}d → $CS_COLD_DIR)" || echo "disabled")" +fi echo "" # ── Uninstall ────────────────────────────────────────────────── if [ "$UNINSTALL" = true ]; then - info "Uninstalling Android Framework..." + info "Uninstalling Lighthouse AI..." if systemctl is-active --quiet openclaw-session-cleanup.timer 2>/dev/null; then sudo systemctl stop openclaw-session-cleanup.timer @@ -186,6 +201,16 @@ if [ "$UNINSTALL" = true ]; then done sudo rm -f /etc/systemd/system/token-spy@.service + # LLM Cold Storage + if systemctl --user is-active --quiet llm-cold-storage.timer 2>/dev/null; then + systemctl --user stop llm-cold-storage.timer + systemctl --user disable llm-cold-storage.timer + ok "Stopped cold storage timer" + fi + rm -f "$HOME/.config/systemd/user/llm-cold-storage.service" + rm -f "$HOME/.config/systemd/user/llm-cold-storage.timer" + systemctl --user daemon-reload 2>/dev/null || true + sudo systemctl daemon-reload rm -f "$OPENCLAW_DIR/session-cleanup.sh" @@ -196,20 +221,24 @@ fi # ── Preflight checks ────────────────────────────────────────── info "Running preflight checks..." -# Check for OpenClaw -if [ ! -d "$OPENCLAW_DIR" ]; then - err "OpenClaw directory not found: $OPENCLAW_DIR" - err "Is OpenClaw installed? Edit openclaw_dir in config.yaml" - exit 1 +# Check for OpenClaw (not needed for cold-storage-only) +if [ "$COLD_STORAGE_ONLY" = false ]; then + if [ ! -d "$OPENCLAW_DIR" ]; then + err "OpenClaw directory not found: $OPENCLAW_DIR" + err "Is OpenClaw installed? Edit openclaw_dir in config.yaml" + exit 1 + fi + ok "OpenClaw directory found: $OPENCLAW_DIR" fi -ok "OpenClaw directory found: $OPENCLAW_DIR" -# Check for python3 -if ! command -v python3 &>/dev/null; then - err "python3 not found. Install Python 3 first." - exit 1 +# Check for python3 (not needed for cold-storage-only) +if [ "$COLD_STORAGE_ONLY" = false ]; then + if ! command -v python3 &>/dev/null; then + err "python3 not found. Install Python 3 first." + exit 1 + fi + ok "Python 3 found: $(python3 --version 2>&1)" fi -ok "Python 3 found: $(python3 --version 2>&1)" # Check for systemd if ! command -v systemctl &>/dev/null; then @@ -405,6 +434,46 @@ TSENV fi fi +# ── Install LLM Cold Storage ──────────────────────────────── +if ([ "$COLD_STORAGE_ONLY" = true ] || ([ "$CLEANUP_ONLY" = false ] && [ "$PROXY_ONLY" = false ] && [ "$TOKEN_SPY_ONLY" = false ])) && [ "$CS_ENABLED" = "true" ]; then + info "Installing LLM Cold Storage..." + + if [ ! -f "$SCRIPT_DIR/scripts/llm-cold-storage.sh" ]; then + err "scripts/llm-cold-storage.sh not found" + exit 1 + fi + + chmod +x "$SCRIPT_DIR/scripts/llm-cold-storage.sh" + ok "Cold storage script: $SCRIPT_DIR/scripts/llm-cold-storage.sh" + + # Install systemd user timer + if [ "$HAS_SYSTEMD" = true ]; then + mkdir -p "$HOME/.config/systemd/user" + + # Service — patch in config values + cp "$SCRIPT_DIR/systemd/llm-cold-storage.service" "$HOME/.config/systemd/user/" + sed -i "s|%h/Lighthouse-AI/scripts|$SCRIPT_DIR/scripts|g" "$HOME/.config/systemd/user/llm-cold-storage.service" + sed -i "s|%h/.cache/huggingface/hub|$CS_HF_CACHE|g" "$HOME/.config/systemd/user/llm-cold-storage.service" + sed -i "s|%h/llm-cold-storage|$CS_COLD_DIR|g" "$HOME/.config/systemd/user/llm-cold-storage.service" + # Remove User=%i (not needed for user services) + sed -i '/^User=%i/d' "$HOME/.config/systemd/user/llm-cold-storage.service" + + # Timer + cp "$SCRIPT_DIR/systemd/llm-cold-storage.timer" "$HOME/.config/systemd/user/" + + systemctl --user daemon-reload + systemctl --user enable llm-cold-storage.timer + systemctl --user start llm-cold-storage.timer + + ok "Cold storage timer enabled (daily at 2am)" + info " Dry-run first: $SCRIPT_DIR/scripts/llm-cold-storage.sh" + info " Execute: $SCRIPT_DIR/scripts/llm-cold-storage.sh --execute" + else + info "No systemd. Run manually:" + info " HF_CACHE=$CS_HF_CACHE COLD_DIR=$CS_COLD_DIR $SCRIPT_DIR/scripts/llm-cold-storage.sh --execute" + fi +fi + # ── OpenClaw Config Reminder ────────────────────────────────── echo "" echo -e "${CYAN}═══════════════════════════════════════════════════════════${NC}" @@ -452,5 +521,9 @@ if [ "$HAS_SYSTEMD" = true ]; then echo " journalctl -u token-spy@${TS_AGENT_NAME} -f # Watch Token Spy logs" echo " curl http://localhost:${TS_PORT}/health # Test Token Spy health" fi + if [ "$CS_ENABLED" = "true" ] && ([ "$COLD_STORAGE_ONLY" = true ] || ([ "$CLEANUP_ONLY" = false ] && [ "$PROXY_ONLY" = false ] && [ "$TOKEN_SPY_ONLY" = false ])); then + echo " systemctl --user status llm-cold-storage.timer # Check cold storage timer" + echo " systemctl --user list-timers llm-cold-storage.timer # Next run time" + fi fi echo "" diff --git a/memory-shepherd/README.md b/memory-shepherd/README.md index d7340219e..c64cb6f9d 100644 --- a/memory-shepherd/README.md +++ b/memory-shepherd/README.md @@ -52,8 +52,8 @@ Each reset cycle: ```bash # Clone the repo -git clone https://github.com/Light-Heart-Labs/Android-Framework.git -cd Android-Framework/memory-shepherd +git clone https://github.com/Light-Heart-Labs/Lighthouse-AI.git +cd Lighthouse-AI/memory-shepherd # Create your config from the example cp memory-shepherd.conf.example memory-shepherd.conf diff --git a/scripts/llm-cold-storage.sh b/scripts/llm-cold-storage.sh new file mode 100755 index 000000000..65954f881 --- /dev/null +++ b/scripts/llm-cold-storage.sh @@ -0,0 +1,246 @@ +#!/usr/bin/env bash +# +# llm-cold-storage.sh — Archive idle HuggingFace models to cold storage +# +# Part of Lighthouse AI tooling. +# +# Models not accessed in 7+ days are moved to cold storage on a backup drive. +# A symlink replaces the original so HuggingFace cache resolution still works. +# Models can be restored manually or are auto-detected if a process loads them. +# +# Usage: +# ./llm-cold-storage.sh # Archive idle models (dry-run) +# ./llm-cold-storage.sh --execute # Archive idle models (for real) +# ./llm-cold-storage.sh --restore # Restore a specific model +# ./llm-cold-storage.sh --restore-all # Restore all archived models +# ./llm-cold-storage.sh --status # Show archive status +# +set -uo pipefail + +HF_CACHE="${HF_CACHE:-$HOME/.cache/huggingface/hub}" +COLD_DIR="${COLD_DIR:-$HOME/llm-cold-storage}" +LOG_FILE="${LOG_FILE:-$HOME/.local/log/llm-cold-storage.log}" +MAX_IDLE_DAYS=7 + +# Ensure the log directory exists +mkdir -p "$(dirname "$LOG_FILE")" + +# Models to never archive (currently serving or critical) +# Example: +# PROTECTED_MODELS=( +# "models--Qwen--Qwen3-Coder-Next-FP8" +# ) +PROTECTED_MODELS=( +) + +log() { + local msg="[$(date '+%Y-%m-%d %H:%M:%S')] $*" + echo "$msg" | tee -a "$LOG_FILE" +} + +is_protected() { + local name="$1" + for p in "${PROTECTED_MODELS[@]}"; do + [[ "$name" == "$p" ]] && return 0 + done + return 1 +} + +is_model_in_use() { + local name="$1" + # Extract model identifier: models--Org--Name -> Org/Name + local model_id + model_id="$(echo "$name" | sed 's/^models--//; s/--/\//g')" + + # Check if any running process references this model + if pgrep -af "$model_id" > /dev/null 2>&1; then + return 0 + fi + return 1 +} + +get_last_access_days() { + local dir="$1" + # Check most recent access time across all blobs in the model + local newest_atime + newest_atime="$(find "$dir" -type f -printf '%A@\n' 2>/dev/null | sort -rn | head -1)" + if [[ -z "$newest_atime" ]]; then + echo "9999" + return + fi + local now + now="$(date +%s)" + local age_secs + age_secs="$(echo "$now - ${newest_atime%.*}" | bc)" + echo "$(( age_secs / 86400 ))" +} + +do_archive() { + local dry_run="${1:-true}" + local archived=0 + local skipped=0 + + log "========== LLM cold storage scan started (dry_run=$dry_run) ==========" + + for model_dir in "$HF_CACHE"/models--*/; do + [[ -d "$model_dir" ]] || continue + # Skip if already a symlink (already archived) + [[ -L "${model_dir%/}" ]] && continue + + local name + name="$(basename "$model_dir")" + + # Skip protected models + if is_protected "$name"; then + log "SKIP (protected): $name" + ((skipped++)) + continue + fi + + # Skip if actively in use by a process + if is_model_in_use "$name"; then + log "SKIP (in use): $name" + ((skipped++)) + continue + fi + + local idle_days + idle_days="$(get_last_access_days "$model_dir")" + local size + size="$(du -sh "$model_dir" 2>/dev/null | cut -f1)" + + if (( idle_days >= MAX_IDLE_DAYS )); then + if [[ "$dry_run" == "true" ]]; then + log "WOULD ARCHIVE: $name ($size, idle ${idle_days}d)" + else + log "ARCHIVING: $name ($size, idle ${idle_days}d)" + # Move to cold storage + mv "$model_dir" "$COLD_DIR/$name" + # Create symlink so HF cache still resolves + ln -s "$COLD_DIR/$name" "${model_dir%/}" + log "ARCHIVED: $name -> $COLD_DIR/$name" + fi + ((archived++)) + else + log "SKIP (recent, ${idle_days}d): $name ($size)" + ((skipped++)) + fi + done + + log "========== Scan complete: $archived archived, $skipped skipped ==========" +} + +do_restore() { + local name="$1" + + # Normalize: accept "Qwen/Qwen2.5-7B" or "models--Qwen--Qwen2.5-7B" + if [[ "$name" != models--* ]]; then + name="models--$(echo "$name" | sed 's/\//--/g')" + fi + + local cold_path="$COLD_DIR/$name" + local cache_path="$HF_CACHE/$name" + + if [[ ! -d "$cold_path" ]]; then + echo "ERROR: Model not found in cold storage: $cold_path" + exit 1 + fi + + # Remove symlink if it exists + if [[ -L "$cache_path" ]]; then + rm "$cache_path" + fi + + log "RESTORING: $name to $cache_path" + mv "$cold_path" "$cache_path" + log "RESTORED: $name" + echo "Restored: $name" +} + +do_restore_all() { + log "========== Restoring all archived models ==========" + for cold_model in "$COLD_DIR"/models--*/; do + [[ -d "$cold_model" ]] || continue + local name + name="$(basename "$cold_model")" + local cache_path="$HF_CACHE/$name" + + if [[ -L "$cache_path" ]]; then + rm "$cache_path" + fi + + log "RESTORING: $name" + mv "$cold_model" "$cache_path" + log "RESTORED: $name" + done + log "========== All models restored ==========" +} + +show_status() { + echo "=== LLM Cold Storage Status ===" + echo "" + + echo "Active models (on NVMe):" + for model_dir in "$HF_CACHE"/models--*/; do + [[ -d "$model_dir" ]] || continue + local name + name="$(basename "$model_dir")" + if [[ -L "${model_dir%/}" ]]; then + local size + size="$(du -sh "$model_dir" 2>/dev/null | cut -f1)" + echo " [SYMLINK -> cold] $name ($size)" + else + local size idle_days status="" + size="$(du -sh "$model_dir" 2>/dev/null | cut -f1)" + idle_days="$(get_last_access_days "$model_dir")" + is_protected "$name" && status=" [protected]" + is_model_in_use "$name" && status=" [in use]" + echo " [HOT] $name ($size, idle ${idle_days}d)${status}" + fi + done + + echo "" + echo "Archived models (on backup SSD):" + local has_archived=false + for cold_model in "$COLD_DIR"/models--*/; do + [[ -d "$cold_model" ]] || continue + has_archived=true + local name size + name="$(basename "$cold_model")" + size="$(du -sh "$cold_model" 2>/dev/null | cut -f1)" + echo " [COLD] $name ($size)" + done + $has_archived || echo " (none)" + + echo "" + echo "NVMe cache total: $(du -sh "$HF_CACHE" 2>/dev/null | cut -f1)" + echo "Cold storage total: $(du -sh "$COLD_DIR" 2>/dev/null | cut -f1)" +} + +case "${1:-}" in + --execute) + do_archive false + ;; + --restore) + [[ -n "${2:-}" ]] || { echo "Usage: $0 --restore "; exit 1; } + do_restore "$2" + ;; + --restore-all) + do_restore_all + ;; + --status) + show_status + ;; + --help|-h) + echo "Usage: $0 [--execute|--restore |--restore-all|--status|--help]" + echo "" + echo " (no args) Dry-run: show what would be archived" + echo " --execute Archive idle models (>$MAX_IDLE_DAYS days)" + echo " --restore Restore model from cold storage" + echo " --restore-all Restore all archived models" + echo " --status Show current hot/cold status" + ;; + *) + do_archive true + ;; +esac diff --git a/scripts/session-cleanup.sh b/scripts/session-cleanup.sh index 36f4e7e0a..9abb47b09 100644 --- a/scripts/session-cleanup.sh +++ b/scripts/session-cleanup.sh @@ -1,7 +1,7 @@ #!/bin/bash # ═══════════════════════════════════════════════════════════════ -# Android Framework - Session Cleanup Script -# https://github.com/Light-Heart-Labs/Android-Framework +# Lighthouse AI - Session Cleanup Script +# https://github.com/Light-Heart-Labs/Lighthouse-AI # # Prevents context overflow crashes by automatically managing # session file lifecycle. When a session file exceeds the size diff --git a/scripts/vllm-tool-proxy.py b/scripts/vllm-tool-proxy.py index 68096d914..7a5123a17 100644 --- a/scripts/vllm-tool-proxy.py +++ b/scripts/vllm-tool-proxy.py @@ -1,6 +1,6 @@ #!/usr/bin/env python3 """ -Android Framework — vLLM Tool Call Proxy (v4) +Lighthouse AI — vLLM Tool Call Proxy (v4) Bridges OpenClaw with local vLLM instances by handling three incompatibilities: @@ -479,7 +479,7 @@ def health(): @app.route('/') def root(): return { - 'service': 'Android Framework — vLLM Tool Call Proxy', + 'service': 'Lighthouse AI — vLLM Tool Call Proxy', 'version': 'v4', 'vllm_url': VLLM_URL, 'features': [ @@ -499,7 +499,7 @@ def root(): # ═══════════════════════════════════════════════════════════════ if __name__ == '__main__': - parser = argparse.ArgumentParser(description='Android Framework — vLLM Tool Call Proxy') + parser = argparse.ArgumentParser(description='Lighthouse AI — vLLM Tool Call Proxy') parser.add_argument('--port', type=int, default=int(os.environ.get('PROXY_PORT', '8003')), help='Port to listen on (default: 8003, env: PROXY_PORT)') parser.add_argument('--vllm-url', type=str, default=VLLM_URL, @@ -508,6 +508,6 @@ def root(): help='Host to bind to (default: 0.0.0.0)') args = parser.parse_args() VLLM_URL = args.vllm_url - logger.info(f'Starting Android Framework vLLM Tool Call Proxy v4') + logger.info(f'Starting Lighthouse AI vLLM Tool Call Proxy v4') logger.info(f'Listening on {args.host}:{args.port} -> {VLLM_URL}') app.run(host=args.host, port=args.port, threaded=True) diff --git a/systemd/llm-cold-storage.service b/systemd/llm-cold-storage.service new file mode 100644 index 000000000..c7fc07ba4 --- /dev/null +++ b/systemd/llm-cold-storage.service @@ -0,0 +1,13 @@ +[Unit] +Description=LLM Cold Storage — Archive idle HuggingFace models +Documentation=https://github.com/Light-Heart-Labs/Lighthouse-AI + +[Service] +Type=oneshot +ExecStart=%h/Lighthouse-AI/scripts/llm-cold-storage.sh --execute +Environment=HF_CACHE=%h/.cache/huggingface/hub +Environment=COLD_DIR=%h/llm-cold-storage +Environment=LOG_FILE=%h/.local/log/llm-cold-storage.log + +# Run as the user who owns the HF cache +User=%i diff --git a/systemd/llm-cold-storage.timer b/systemd/llm-cold-storage.timer new file mode 100644 index 000000000..93e0877f3 --- /dev/null +++ b/systemd/llm-cold-storage.timer @@ -0,0 +1,11 @@ +[Unit] +Description=Run LLM Cold Storage daily at 2am +Documentation=https://github.com/Light-Heart-Labs/Lighthouse-AI + +[Timer] +OnCalendar=*-*-* 02:00:00 +RandomizedDelaySec=900 +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/systemd/openclaw-session-cleanup.service b/systemd/openclaw-session-cleanup.service index 796166ce0..db823bbc9 100644 --- a/systemd/openclaw-session-cleanup.service +++ b/systemd/openclaw-session-cleanup.service @@ -1,5 +1,5 @@ [Unit] -Description=Android Framework - Session Cleanup +Description=Lighthouse AI - Session Cleanup After=network.target docker.service [Service] diff --git a/systemd/openclaw-session-cleanup.timer b/systemd/openclaw-session-cleanup.timer index e553f22eb..1515b0724 100644 --- a/systemd/openclaw-session-cleanup.timer +++ b/systemd/openclaw-session-cleanup.timer @@ -1,5 +1,5 @@ [Unit] -Description=Android Framework - Session Cleanup every __INTERVAL__ minutes +Description=Lighthouse AI - Session Cleanup every __INTERVAL__ minutes [Timer] OnBootSec=__BOOT_DELAY__min diff --git a/systemd/token-spy@.service b/systemd/token-spy@.service index 42c7efc4a..eaa98f813 100644 --- a/systemd/token-spy@.service +++ b/systemd/token-spy@.service @@ -1,5 +1,5 @@ [Unit] -Description=Android Framework - Token Spy (%i) +Description=Lighthouse AI - Token Spy (%i) After=network.target [Service] diff --git a/systemd/vllm-tool-proxy.service b/systemd/vllm-tool-proxy.service index 50bbe6a40..f0b80d454 100644 --- a/systemd/vllm-tool-proxy.service +++ b/systemd/vllm-tool-proxy.service @@ -1,5 +1,5 @@ [Unit] -Description=Android Framework - vLLM Tool Call Proxy +Description=Lighthouse AI - vLLM Tool Call Proxy After=network.target docker.service Wants=docker.service