diff --git a/README.md b/README.md index bf13b7b..d4faddc 100644 --- a/README.md +++ b/README.md @@ -65,6 +65,7 @@ Embed AMD-optimized AI into end-user applications. | Skill | What it does | | --- | --- | | `local-ai-app-integration` | Add private, on-device AI to apps that use OpenAI, Anthropic, or Ollama APIs by bundling Embeddable Lemonade as a subprocess. | +| `local-ai-use` | Apply a Lemonade-first strategy so agents default to local image generation, text-to-speech, and speech-to-text to reduce token/cost usage before any cloud fallback. | ### Cross-stack porting @@ -226,7 +227,7 @@ See [AUTHORING.md](AUTHORING.md) for the full authoring guide, including when a ## Status -This repository is in its early days. The first in-repo skill, `skills/local-ai-app-integration/`, is available now and seeds the **Application integration** focus area. The Hardware-native, Cross-stack porting, and Profiling and delivery focus areas are being built out incrementally alongside manifests and CI. Expect rapid iteration. File an issue if there is a workflow you want covered, or open a PR with a skill you have been wanting to share. +This repository is in its early days. In-repo skills include `skills/local-ai-app-integration/` and `skills/local-ai-use/`, seeding the **Application integration** focus area. The Hardware-native, Cross-stack porting, and Profiling and delivery focus areas are being built out incrementally alongside manifests and CI. Expect rapid iteration. File an issue if there is a workflow you want covered, or open a PR with a skill you have been wanting to share. ## License diff --git a/skills/local-ai-use/SKILL.md b/skills/local-ai-use/SKILL.md new file mode 100644 index 0000000..62c6ae8 --- /dev/null +++ b/skills/local-ai-use/SKILL.md @@ -0,0 +1,256 @@ +--- +name: local-ai-use +description: >- + Routes image generation, text-to-speech, and speech-to-text through a local + Lemonade Server so the agent stops paying for cloud calls on those modalities. + Use when the user wants to save tokens, save cost, or save money using local + AI; default to local, offline, on-device, or private image generation, + transcription, or text-to-speech in this workspace; stop using DALL-E, + Whisper-as-a-service, ElevenLabs, or other paid multimodal APIs; route the + agent's image, TTS, or STT tool calls to a local model; or mentions Lemonade + Server, OmniRouter, SD-Turbo, kokoro, Whisper, Ryzen AI, NPU/iGPU/dGPU + inference, or "use local for images but cloud for chat". Run once per + workspace; the rule it installs handles every later request. +--- + +# Local AI Use (route image, TTS, STT through Lemonade) + +This is a **meta-skill**. You run it once. After that, every later request that +needs image generation, text-to-speech, or speech-to-text uses the local +[Lemonade Server](https://lemonade-server.ai) instead of a cloud API. The +agent's own LLM keeps handling text; only the expensive multimodal calls move +on-device. + +The skill does two things: + +1. **Verifies that local Lemonade is reachable and has the right models.** +2. **Drops a `Local AI Use` block into the workspace `AGENTS.md`** so the agent + reads the routing rule on every later turn, in Cursor, Claude Code, Codex, + Gemini CLI, and any other agent that respects `AGENTS.md`. + +## When to use this skill + +Use this skill when **all** of the following are true: + +- The user has, or is willing to install, the system-wide Lemonade Server. +- The user accepts the default Lemonade endpoint `http://localhost:13305`. +- The user wants the change to be **persistent** across future turns and + agent restarts (the rule is written to disk). + +If the user is instead **embedding** Lemonade as a private subprocess inside +an app installer, do not use this skill; use `local-ai-app-integration` +instead. + +## Prerequisites + +- **OS:** Windows 11 x64, Ubuntu/Debian x64, or macOS (beta). +- **Lemonade Server CLI on `PATH`:** verify with `lemonade --version`. If + missing, install from + before continuing. Do not silently install on the user's machine; that is a + system-wide change and must be the user's call. +- **Disk:** ~8 GB free for the three default models (SD-Turbo + Whisper-Tiny + + kokoro-v1). +- **Network:** required for the first `lemonade pull` of each model. After + that, every modality runs offline. + +## The opinionated path + +Run this checklist top to bottom. Track progress against it; do not move on +until each step verifies. + +``` +[ ] 1. Confirm Lemonade Server is installed and reachable +[ ] 2. Pull the three default modality models +[ ] 3. Install the routing rule into the workspace AGENTS.md +[ ] 4. Smoke-test image, TTS, and STT against the local endpoint +``` + +The single command that does steps 1, 2, and 3 in one shot is: + +```bash +python scripts/setup_local_ai.py +``` + +(Run from this skill's folder.) The script is idempotent: re-running it on a +fully configured workspace is a no-op apart from a healthcheck. Read the +sections below for what to do when each step fails. + +--- + +## Step 1: confirm Lemonade Server is reachable + +Run: + +```bash +lemonade status --json +``` + +Two acceptable outcomes: + +| `lemonade status` says | Action | +|---|---| +| `Server is running on port 13305` | Continue to Step 2. | +| `Server is not running` | Start it. On Windows, launch the **Lemonade** Start Menu shortcut. On Linux, run `sudo systemctl start lemonade-server`. Re-check `lemonade status`. | + +If `lemonade` is not on `PATH` at all, the server is not installed. Stop and +point the user at . Do not +attempt a silent install. + +The rest of this skill assumes the endpoint is `http://localhost:13305/api/v1` +and no API key is required (the system-wide server defaults to no auth on +loopback). If the user has set `LEMONADE_API_KEY`, the routing rule template +in `templates/local-ai-rule.md` shows where to add the `Authorization` header. + +## Step 2: pull the three default modality models + +Pull these three. They are the **Lite Collection** defaults from Lemonade +OmniRouter, sized to keep token-and-cost savings real on commodity hardware: + +| Modality | Model | Size | Why this default | +|---|---|---|---| +| Image generation | `SD-Turbo` | ~5 GB | Single-step generation, runs on CPU and AMD iGPU/dGPU | +| Text-to-speech | `kokoro-v1` | ~0.3 GB | Only TTS model Lemonade currently supports; CPU-only, low latency | +| Speech-to-text | `Whisper-Tiny` | ~0.1 GB | Smallest Whisper; fast on CPU. Upgrade to `Whisper-Large-v3-Turbo` if accuracy matters more than latency. | + +```bash +lemonade pull SD-Turbo +lemonade pull kokoro-v1 +lemonade pull Whisper-Tiny +``` + +Each `pull` is idempotent. To verify what is already downloaded: + +```bash +lemonade list --downloaded +``` + +For coverage of larger / higher-quality alternatives (`SDXL-Turbo`, +`Flux-2-Klein-4B`, `Whisper-Large-v3-Turbo`), see the +[model picker in reference.md](reference.md#model-picker). + +## Step 3: install the routing rule into AGENTS.md + +The rule is a Markdown block stored in [`templates/local-ai-rule.md`](templates/local-ai-rule.md). +Append it to the workspace's `AGENTS.md` (create the file if missing). Both +Cursor and Claude Code load `AGENTS.md` automatically on every turn, so the +agent will see the rule on its next message without any further setup. + +`scripts/setup_local_ai.py` does this for you, surrounded by stable markers +so re-running the script replaces the block in place rather than appending +a second copy. The markers look like: + +``` + +...rule... + +``` + +If you write the file by hand, keep those exact markers. The script relies +on them for idempotent updates. + +If the user's agent only respects a different convention, mirror the same +block to: + +- `CLAUDE.md` (Claude Code, project-scoped) or `~/.claude/CLAUDE.md` (global) +- `.cursor/rules/local-ai-use.mdc` (Cursor user/project rules) +- `GEMINI.md` (Gemini CLI) + +The rule's content is identical; only the file location changes. + +## Step 4: smoke-test the three modalities + +Verify each modality against the live server before declaring success. These +mirror the inline patterns in the installed rule, so a green pass here means +the rule will work. + +**Image generation** (writes `out.png`): + +```bash +curl -sX POST http://localhost:13305/api/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{"model":"SD-Turbo","prompt":"a single red apple on a white table","size":"512x512","steps":4,"response_format":"b64_json"}' \ + | python -c "import sys,json,base64; open('out.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))" +``` + +**Text-to-speech** (writes `out.mp3`): + +```bash +curl -sX POST http://localhost:13305/api/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d '{"model":"kokoro-v1","input":"Local AI is now active.","response_format":"mp3"}' \ + -o out.mp3 +``` + +**Speech-to-text** (round-trips `out.mp3` → text via a wav re-encode): + +```bash +ffmpeg -y -i out.mp3 -ar 16000 -ac 1 out.wav +curl -sX POST http://localhost:13305/api/v1/audio/transcriptions \ + -F "file=@out.wav" -F "model=Whisper-Tiny" +``` + +If any of the three returns a non-2xx status, fix it now. The rule we just +installed sends future requests to these same endpoints, so a broken endpoint +becomes a broken user experience. + +--- + +## What changes after this skill runs + +From the next turn onward, the agent reads the rule in `AGENTS.md` on every +message. The rule explicitly tells the agent: + +- **For image generation:** call `POST /api/v1/images/generations` on the + local server. Do **not** call any cloud image API and do **not** use the + built-in `GenerateImage` tool (that path bills tokens to the cloud + provider). +- **For text-to-speech:** call `POST /api/v1/audio/speech`. Do **not** call + cloud TTS providers (OpenAI TTS, ElevenLabs, etc.). +- **For speech-to-text:** call `POST /api/v1/audio/transcriptions`. Do + **not** call cloud transcription providers. +- **Fallback:** only fall back to a cloud API after one local attempt has + failed *and* the user has been told the local call failed. Never + silently fall back; the whole point of this skill is to keep cost + predictable. + +The agent's own text reasoning continues to use whatever LLM Cursor / Claude +Code / Codex is configured with. This skill does not redirect chat tokens; +it only redirects the multimodal calls that would otherwise leave the +machine. + +## Troubleshooting cheatsheet + +| Symptom | Cause | Recovery | +|---|---|---| +| `lemonade: command not found` | Server CLI not installed | Install from ; restart shell. | +| `Server is not running` | Service stopped after install | Windows: launch the **Lemonade** Start Menu shortcut. Linux: `sudo systemctl start lemonade-server`. | +| `POST /v1/images/generations` returns 404 model not found | Image model not downloaded | `lemonade pull SD-Turbo` and retry. | +| Image generation is slow on CPU (~4–5 min) | sd-cpp on CPU backend | Install the GPU backend on supported AMD hardware: `lemonade backends install sd-cpp:rocm`. | +| `POST /v1/audio/transcriptions` returns 400 unsupported format | Input is not 16 kHz mono WAV | Re-encode with `ffmpeg -i in.* -ar 16000 -ac 1 out.wav`. | +| `POST /v1/audio/speech` returns 404 | TTS model not downloaded | `lemonade pull kokoro-v1`. | +| 401 Unauthorized on every request | User has set `LEMONADE_API_KEY` | Add `Authorization: Bearer $LEMONADE_API_KEY` to every request and to the rule block. | + +## Verification checklist + +Mark this skill complete only when **all** of the following are true: + +- [ ] `lemonade status --json` reports the server running on port 13305. +- [ ] `lemonade list --downloaded` shows `SD-Turbo`, `kokoro-v1`, and + `Whisper-Tiny`. +- [ ] The workspace `AGENTS.md` contains the + `amd-skills:local-ai-use` block. +- [ ] All three smoke tests in Step 4 succeed. +- [ ] On a follow-up turn, asking the agent to "generate an image of X" + causes it to POST to `http://localhost:13305/api/v1/images/generations` + rather than calling a cloud tool. + +If any box is unchecked, the user is still paying cloud cost for at least +one modality. + +--- + +## Reference + +For the full model picker, alternate-quality options, the complete endpoint +reference, the API-key flow, and the OmniRouter tool definitions you can +hand to an agent's tool-calling loop, see [reference.md](reference.md). diff --git a/skills/local-ai-use/reference.md b/skills/local-ai-use/reference.md new file mode 100644 index 0000000..69e6775 --- /dev/null +++ b/skills/local-ai-use/reference.md @@ -0,0 +1,241 @@ +# Local AI Use: Reference + +Detailed reference for the `local-ai-use` skill. Read this only when the +default path in `SKILL.md` doesn't cover a decision. + +## Contents + +- [Model picker](#model-picker) +- [Endpoint reference](#endpoint-reference) +- [API key handling](#api-key-handling) +- [Hardware-accelerated backends](#hardware-accelerated-backends) +- [OmniRouter tool definitions](#omnirouter-tool-definitions) +- [Re-pointing the rule at a remote host](#re-pointing-the-rule-at-a-remote-host) +- [Removing the rule](#removing-the-rule) + +--- + +## Model picker + +The default trio (`SD-Turbo`, `kokoro-v1`, `Whisper-Tiny`) is sized for +"keeps cost savings real on a typical laptop". Override only if the user +asks for higher quality or has explicit hardware to spare. + +### Image generation (`recipe: sd-cpp`) + +| Model | Approx size | When to use | Trade-off | +|---|---|---|---| +| `SD-Turbo` | ~5 GB | **Default.** General-purpose, single-step (4-step) generation. | Lower fidelity than SDXL. | +| `SDXL-Turbo` | ~6.9 GB | When the user notices quality issues with SD-Turbo. | Larger model, slower on CPU. | +| `SD-1.5` | ~4 GB | When the user asks for "Stable Diffusion 1.5" by name. | Needs more steps (~20). | +| `Flux-2-Klein-4B` | ~4 GB | Image **editing** (`/v1/images/edits`). | Editing-capable, slower than SD-Turbo for plain generation. | + +To upgrade: `lemonade pull `, then change `"model"` in the rule +block in `AGENTS.md` to the new model id. + +### Text-to-speech (`recipe: kokoro`) + +| Model | Approx size | When to use | +|---|---|---| +| `kokoro-v1` | ~0.3 GB | **Default and only supported model today.** CPU-only, low latency. | + +Voices: `shimmer` (default), plus all OpenAI-named voices (`alloy`, `ash`, +`ballad`, `coral`, `echo`, `fable`, `nova`, `onyx`, `sage`, `verse`) and the +kokoro-native voices (`af_sky`, `am_echo`, `bf_emma`, `bm_george`, ...). +Pass `voice` in the request body to override. + +### Speech-to-text (`recipe: whispercpp`) + +| Model | Approx size | When to use | +|---|---|---| +| `Whisper-Tiny` | ~0.1 GB | **Default.** English-only fast path; great for short clips and meeting notes. | +| `Whisper-Base` | ~0.3 GB | Slightly better accuracy, still tiny. | +| `Whisper-Small` | ~1.0 GB | Multilingual, modest CPU cost. | +| `Whisper-Large-v3-Turbo` | ~1.6 GB | Highest quality / latency mix; recommended when accuracy matters. | + +Whisper requires 16 kHz mono PCM WAV input. Convert anything else first: + +```bash +ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav +``` + +For full live coverage, run `lemonade list` after starting the server, or +browse . + +--- + +## Endpoint reference + +All multimodal endpoints accept the standard OpenAI request shape and +return the standard OpenAI response shape, so any OpenAI-compatible client +works (`openai-python`, `openai-node`, `openai-dotnet`, `go-openai`, ...). + +| Method | Path | Purpose | Backend | +|---|---|---|---| +| `POST` | `/api/v1/images/generations` | text → image (b64) | `sd-cpp` | +| `POST` | `/api/v1/images/edits` | image + prompt → image | `sd-cpp` | +| `POST` | `/api/v1/images/variations` | image → varied image | `sd-cpp` | +| `POST` | `/api/v1/images/upscale` | image → upscaled image | ESRGAN | +| `POST` | `/api/v1/audio/speech` | text → audio file | `kokoro` | +| `POST` | `/api/v1/audio/transcriptions` | wav → text | `whispercpp` | +| `WS` | `/realtime` | streaming microphone → text | `whispercpp` | +| `GET` | `/api/v1/models` | list models (add `?show_all=true` for catalog) | n/a | +| `GET` | `/api/v1/health` | readiness probe | n/a | + +Notable per-endpoint quirks: + +- **`/v1/images/generations`**: only `n=1` and `response_format=b64_json` + are supported today. `size` defaults to `512x512`. The model's + `image_defaults` (steps / cfg_scale / width / height) returned by + `/v1/models/{id}` are the right values to use as your defaults. +- **`/v1/images/edits`**: `multipart/form-data` (not JSON). `mask` is + optional; without one the entire image is the editable region. +- **`/v1/audio/transcriptions`**: only `wav` input and `json` response are + supported today. Non-WAV input must be re-encoded with `ffmpeg`. +- **`/v1/audio/speech`**: `mp3`, `wav`, `opus`, and `pcm` outputs supported. + Streaming requires `stream_format: "audio"`, which only emits `pcm`. + +For the full parameter list of any endpoint, see `lemonade/docs/api/openai.md` +upstream. + +--- + +## API key handling + +The system-wide Lemonade Server defaults to **no auth** on `localhost`. If +the user has set `LEMONADE_API_KEY` (rare for the system-wide flow, +standard for the embeddable flow), every HTTP request must carry: + +``` +Authorization: Bearer ${LEMONADE_API_KEY} +``` + +Update both: + +1. The shell environment that the agent runs commands in + (`export LEMONADE_API_KEY=...`). +2. The rule block in `AGENTS.md`. Add a sentence near the top of the rule + that says "send `Authorization: Bearer $LEMONADE_API_KEY` on every + request" — it's already mentioned but worth highlighting. + +--- + +## Hardware-accelerated backends + +Default install ships the broad-compatibility backends. To get GPU / NPU +acceleration on supported AMD hardware: + +| Modality | Recipe | Faster backend | Install command | +|---|---|---|---| +| Image gen | `sd-cpp` | `rocm` (Strix Halo, RDNA3/4) | `lemonade backends install sd-cpp:rocm` | +| LLM (separate skill) | `llamacpp` | `rocm` or `vulkan` | `lemonade backends install llamacpp:rocm` | +| ASR | `whispercpp` | `npu` (XDNA2) | `lemonade backends install whispercpp:npu` | +| TTS | `kokoro` | `cpu` only | n/a | + +After installing a backend, set the corresponding pin in `lemonade config +set`: + +```bash +lemonade config set sdcpp_backend rocm +lemonade config set whispercpp_backend npu +``` + +These are persisted in `config.json` and apply on the next model load. + +--- + +## OmniRouter tool definitions + +If the agent uses an OpenAI-style tool-calling loop (Continue, OpenHands, +custom code) instead of plain HTTP, register the same endpoints as named +tools so the LLM can pick them on its own. Lemonade publishes canonical +tool schemas under `OmniRouter`; the minimum useful set for this skill is: + +```json +[ + { + "type": "function", + "function": { + "name": "generate_image", + "description": "Generate an image from a text prompt using local Lemonade Server.", + "parameters": { + "type": "object", + "properties": { + "prompt": {"type": "string"}, + "size": {"type": "string", "default": "512x512"}, + "steps": {"type": "integer", "default": 4} + }, + "required": ["prompt"] + } + } + }, + { + "type": "function", + "function": { + "name": "text_to_speech", + "description": "Speak the given text aloud using local Lemonade Server.", + "parameters": { + "type": "object", + "properties": { + "input": {"type": "string"}, + "voice": {"type": "string", "default": "shimmer"} + }, + "required": ["input"] + } + } + }, + { + "type": "function", + "function": { + "name": "transcribe_audio", + "description": "Transcribe a WAV audio file using local Lemonade Server.", + "parameters": { + "type": "object", + "properties": { + "file_path": {"type": "string"}, + "language": {"type": "string"} + }, + "required": ["file_path"] + } + } + } +] +``` + +Map each `tool_call` to the corresponding endpoint: + +- `generate_image` → `POST /api/v1/images/generations` +- `text_to_speech` → `POST /api/v1/audio/speech` +- `transcribe_audio` → `POST /api/v1/audio/transcriptions` + +For the full canonical schema (including `edit_image` and `analyze_image`), +read `examples/lemonade_tools.py` in the upstream lemonade-sdk repo. + +--- + +## Re-pointing the rule at a remote host + +Lemonade can run on another machine (a workstation with a Ryzen AI NPU, +say) while the agent runs on the laptop. To point this skill at it: + +1. Set `LEMONADE_HOST` and `LEMONADE_PORT` (or pass `--host` / `--port` to + `setup_local_ai.py`). +2. Re-run `python scripts/setup_local_ai.py` so the rule block is rewritten + with the new endpoint baked in. +3. Make sure the remote server is bound to a non-loopback interface + (`lemonade config set host 0.0.0.0`) and that firewall rules allow + inbound 13305. Setting `host` to `0.0.0.0` exposes the server; pair it + with `LEMONADE_API_KEY` so it isn't open to the LAN. + +--- + +## Removing the rule + +To stop routing locally (e.g., the user wants cloud back), open the +workspace `AGENTS.md` and delete everything between +`` and +``. The agent picks up the change on +its next turn. + +The downloaded models stay on disk; remove them with `lemonade delete +` if you want the space back. diff --git a/skills/local-ai-use/scripts/setup_local_ai.py b/skills/local-ai-use/scripts/setup_local_ai.py new file mode 100644 index 0000000..552a3f6 --- /dev/null +++ b/skills/local-ai-use/scripts/setup_local_ai.py @@ -0,0 +1,266 @@ +#!/usr/bin/env -S uv run --quiet +# /// script +# requires-python = ">=3.10" +# dependencies = [] +# /// +"""One-shot setup for the `local-ai-use` skill. + +Performs the three setup steps from SKILL.md: + + 1. Confirms the system-wide Lemonade Server is installed and reachable on + http://localhost:13305 (override with --host / --port or LEMONADE_HOST / + LEMONADE_PORT). + 2. Pulls the three default modality models if they are missing + (image: SD-Turbo, TTS: kokoro-v1, STT: Whisper-Tiny). + 3. Writes the routing rule from `templates/local-ai-rule.md` into + /AGENTS.md, between stable BEGIN/END markers so re-runs + replace the block in place rather than appending. + +The script is idempotent: a second run on a fully configured workspace only +re-runs the healthcheck. It exits non-zero on any unrecoverable failure. + +Constants are documented inline; nothing is magical. +""" + +from __future__ import annotations + +import argparse +import json +import os +import shutil +import subprocess +import sys +import urllib.error +import urllib.request +from pathlib import Path + +# Defaults match the system-wide Lemonade Server install. Both the CLI +# (LEMONADE_HOST / LEMONADE_PORT) and the OpenAI-compatible HTTP endpoints +# bind to these by default. +DEFAULT_HOST = "127.0.0.1" +DEFAULT_PORT = 13305 + +# The Lite Collection from Lemonade OmniRouter. Picked because each fits in +# under ~5 GB and runs on commodity CPU hardware, so the savings vs. cloud +# calls are real on a typical developer laptop. See SKILL.md for upgrade +# paths. +DEFAULT_MODELS = ("SD-Turbo", "kokoro-v1", "Whisper-Tiny") + +# Stable markers around the rule block in AGENTS.md. The script rewrites the +# region between these markers in place; do not change the marker strings or +# every existing AGENTS.md will get a duplicate block on the next run. +BEGIN_MARKER = "" +END_MARKER = "" + +SKILL_DIR = Path(__file__).resolve().parent.parent +RULE_TEMPLATE = SKILL_DIR / "templates" / "local-ai-rule.md" + +INSTALL_URL = "https://lemonade-server.ai/install_options.html" + + +def _print(msg: str) -> None: + """Single-line, prefix-tagged status print so the agent's output stays parseable.""" + print(f"[local-ai-use] {msg}", flush=True) + + +def _http_get(url: str, timeout_s: float) -> tuple[int, bytes]: + req = urllib.request.Request(url) + with urllib.request.urlopen(req, timeout=timeout_s) as r: # noqa: S310 + return r.status, r.read() + + +def check_cli_installed() -> bool: + """Return True if the `lemonade` CLI is on PATH.""" + return shutil.which("lemonade") is not None + + +def check_server_reachable(host: str, port: int) -> bool: + """Return True if /api/v1/health responds 200 within 3 seconds.""" + url = f"http://{host}:{port}/api/v1/health" + try: + status, _ = _http_get(url, timeout_s=3.0) + return status == 200 + except (urllib.error.URLError, OSError): + return False + + +def list_downloaded_models() -> set[str]: + """Return the set of locally downloaded model IDs. + + Uses `lemonade list --downloaded` (CLI) and falls back to + GET /api/v1/models when the CLI lacks the flag. Returning an empty set is + treated as "could not determine" by the caller, which still attempts the + pulls; `lemonade pull` is itself idempotent. + """ + try: + out = subprocess.run( + ["lemonade", "list", "--downloaded", "--json"], + check=True, capture_output=True, text=True, timeout=10, + ).stdout + data = json.loads(out) + return {m.get("id", "") for m in data if isinstance(m, dict)} + except (subprocess.SubprocessError, json.JSONDecodeError, FileNotFoundError): + pass + + try: + status, body = _http_get("http://127.0.0.1:13305/api/v1/models", timeout_s=5) + if status == 200: + data = json.loads(body) + return { + m.get("id", "") for m in data.get("data", []) + if isinstance(m, dict) and m.get("downloaded") + } + except (urllib.error.URLError, OSError, json.JSONDecodeError): + pass + + return set() + + +def pull_model(model: str) -> bool: + """Run `lemonade pull `. Returns True on success.""" + _print(f"pulling {model}...") + try: + subprocess.run( + ["lemonade", "pull", model], + check=True, + # Stream output so the user sees the download progress instead of + # staring at a frozen prompt; SD-Turbo is several GB. + stdout=None, stderr=None, + # SD-Turbo is the largest pull at ~5 GB. 30 minutes is generous + # for a slow connection; below that we'd false-positive on real + # downloads. + timeout=30 * 60, + ) + return True + except subprocess.CalledProcessError as exc: + _print(f"pull failed for {model} (exit {exc.returncode})") + return False + except subprocess.TimeoutExpired: + _print(f"pull timed out for {model} after 30 minutes") + return False + + +def render_rule_block() -> str: + """Read the rule template; pass through unchanged. + + The template already includes BEGIN/END markers and matches the constants + at the top of this file. We re-validate that here so a future template + edit cannot silently drift away from the markers the writer relies on. + """ + if not RULE_TEMPLATE.exists(): + raise FileNotFoundError( + f"Rule template missing: {RULE_TEMPLATE}. " + "Did the skill folder get partially copied?" + ) + text = RULE_TEMPLATE.read_text(encoding="utf-8") + if BEGIN_MARKER not in text or END_MARKER not in text: + raise ValueError( + "Rule template is missing the BEGIN/END markers; refuse to write " + "AGENTS.md because re-runs would append duplicate blocks." + ) + return text.strip() + "\n" + + +def upsert_agents_md(workspace: Path) -> Path: + """Write or replace the rule block inside /AGENTS.md.""" + target = workspace / "AGENTS.md" + block = render_rule_block() + + if not target.exists(): + target.write_text( + "# Agent instructions\n\n" + "Project-scoped rules picked up automatically by Cursor, Claude Code,\n" + "Codex, Gemini CLI, and other AGENTS.md-aware coding agents.\n\n" + f"{block}", + encoding="utf-8", + ) + _print(f"created {target}") + return target + + existing = target.read_text(encoding="utf-8") + if BEGIN_MARKER in existing and END_MARKER in existing: + before, _, rest = existing.partition(BEGIN_MARKER) + _, _, after = rest.partition(END_MARKER) + # Strip trailing newline noise around the spliced region so we don't + # accumulate blank lines on every re-run. + new = before.rstrip() + "\n\n" + block + after.lstrip() + if new == existing: + _print(f"AGENTS.md rule already up to date at {target}") + return target + target.write_text(new, encoding="utf-8") + _print(f"updated rule block in {target}") + return target + + # No existing block: append with a separating blank line. + if not existing.endswith("\n"): + existing += "\n" + target.write_text(existing + "\n" + block, encoding="utf-8") + _print(f"appended rule block to {target}") + return target + + +def main(argv: list[str] | None = None) -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--workspace", + type=Path, + default=Path.cwd(), + help="Workspace root where AGENTS.md should be written (default: cwd).", + ) + parser.add_argument( + "--host", + default=os.environ.get("LEMONADE_HOST", DEFAULT_HOST), + help="Lemonade Server host (default: 127.0.0.1 / $LEMONADE_HOST).", + ) + parser.add_argument( + "--port", + type=int, + default=int(os.environ.get("LEMONADE_PORT", str(DEFAULT_PORT))), + help="Lemonade Server port (default: 13305 / $LEMONADE_PORT).", + ) + parser.add_argument( + "--skip-pull", + action="store_true", + help="Do not pull missing models; just verify and write AGENTS.md.", + ) + args = parser.parse_args(argv) + + if not check_cli_installed(): + _print("FAIL: `lemonade` is not on PATH.") + _print(f"Install Lemonade Server first: {INSTALL_URL}") + return 2 + + if not check_server_reachable(args.host, args.port): + _print( + f"FAIL: Lemonade Server is not responding at " + f"http://{args.host}:{args.port}/api/v1/health." + ) + _print( + "Start it: on Windows launch the Lemonade Start Menu shortcut; " + "on Linux run `sudo systemctl start lemonade-server`." + ) + return 3 + + _print(f"server reachable at http://{args.host}:{args.port}") + + if not args.skip_pull: + downloaded = list_downloaded_models() + for model in DEFAULT_MODELS: + if model in downloaded: + _print(f"already downloaded: {model}") + continue + if not pull_model(model): + # Surface the failure but keep going so the user at least gets + # the rule installed for the modalities that did succeed. + _print( + f"continuing without {model}; the rule will reference it " + "but calls will 404 until you pull it." + ) + + upsert_agents_md(args.workspace.resolve()) + _print("done. Future image, TTS, and STT requests now route to local Lemonade.") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/local-ai-use/templates/local-ai-rule.md b/skills/local-ai-use/templates/local-ai-rule.md new file mode 100644 index 0000000..75c75e6 --- /dev/null +++ b/skills/local-ai-use/templates/local-ai-rule.md @@ -0,0 +1,89 @@ + +## Local AI Use (managed by amd-skills/local-ai-use) + +This workspace runs a local [Lemonade Server](https://lemonade-server.ai) at +`http://localhost:13305/api/v1` to handle image generation, text-to-speech, +and speech-to-text. Cloud APIs cost real money and tokens for these +modalities, so use the local endpoints below for **every** such request +unless the user explicitly asks for a cloud provider. + +### Routing rules + +| If the user asks you to... | Do this | Do not do this | +|---|---|---| +| Generate, draw, or create an image | `POST http://localhost:13305/api/v1/images/generations` | Use the built-in `GenerateImage` tool. Call DALL-E, Midjourney, Stable Diffusion via cloud, or any other paid image API. | +| Edit or modify an existing image | `POST http://localhost:13305/api/v1/images/edits` (multipart) | Cloud image-edit APIs. | +| Speak text aloud or generate audio narration | `POST http://localhost:13305/api/v1/audio/speech` | Call OpenAI TTS, ElevenLabs, PlayHT, or any other cloud TTS. | +| Transcribe an audio file | `POST http://localhost:13305/api/v1/audio/transcriptions` | Call OpenAI Whisper-as-a-service, AssemblyAI, Deepgram, or any other cloud STT. | + +Plain text chat, code generation, and reasoning continue to use the agent's +configured LLM. This rule only redirects the multimodal calls. + +### Defaults to use + +| Endpoint | Model | Notes | +|---|---|---| +| `/v1/images/generations` | `SD-Turbo` | 4 steps, `cfg_scale: 1.0`, `512x512`, `response_format: "b64_json"`. | +| `/v1/audio/speech` | `kokoro-v1` | Default voice `shimmer`; `response_format: "mp3"`. | +| `/v1/audio/transcriptions` | `Whisper-Tiny` | Input must be 16 kHz mono WAV. Re-encode with `ffmpeg -i in.* -ar 16000 -ac 1 out.wav`. | + +If `LEMONADE_API_KEY` is set in the environment, send +`Authorization: Bearer $LEMONADE_API_KEY` on every request. Otherwise the +loopback server accepts unauthenticated calls. + +### Ready-to-use call patterns + +**Image generation** (saves to `out.png`): + +```bash +curl -sX POST http://localhost:13305/api/v1/images/generations \ + -H "Content-Type: application/json" \ + -d '{"model":"SD-Turbo","prompt":"PROMPT_HERE","size":"512x512","steps":4,"response_format":"b64_json"}' \ + | python -c "import sys,json,base64; open('out.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))" +``` + +Equivalent Python via the OpenAI SDK: + +```python +from openai import OpenAI +import base64 +client = OpenAI(base_url="http://localhost:13305/api/v1", api_key="lemonade") +r = client.images.generate(model="SD-Turbo", prompt="PROMPT_HERE", size="512x512") +open("out.png", "wb").write(base64.b64decode(r.data[0].b64_json)) +``` + +**Text-to-speech** (saves to `out.mp3`): + +```bash +curl -sX POST http://localhost:13305/api/v1/audio/speech \ + -H "Content-Type: application/json" \ + -d '{"model":"kokoro-v1","input":"TEXT_HERE","voice":"shimmer","response_format":"mp3"}' \ + -o out.mp3 +``` + +**Speech-to-text** (returns JSON `{"text": "..."}`): + +```bash +ffmpeg -y -i INPUT_AUDIO -ar 16000 -ac 1 _stt.wav +curl -sX POST http://localhost:13305/api/v1/audio/transcriptions \ + -F "file=@_stt.wav" -F "model=Whisper-Tiny" +``` + +### Failure handling + +1. Try the local endpoint exactly once. +2. If the server is unreachable, run `lemonade status` and surface the + result to the user before doing anything else. +3. If the model is missing, run `lemonade pull ` and retry once. +4. Only after that, ask the user before falling back to a cloud provider. + Never silently fall back; the whole point of this rule is predictable + cost. + +### Re-pointing to a different host + +If the user runs Lemonade on a different host or port, replace the +`http://localhost:13305` prefix everywhere above with their endpoint, and +update `LEMONADE_HOST` / `LEMONADE_PORT` in the shell environment so the +`lemonade` CLI matches. + +