Skip to content

Latest commit

 

History

History
256 lines (194 loc) · 10.4 KB

File metadata and controls

256 lines (194 loc) · 10.4 KB
name local-ai-use
description Routes image generation, text-to-speech, and speech-to-text through a local Lemonade Server so the agent stops paying for cloud calls on those modalities. Use when the user wants to save tokens, save cost, or save money using local AI; default to local, offline, on-device, or private image generation, transcription, or text-to-speech in this workspace; stop using DALL-E, Whisper-as-a-service, ElevenLabs, or other paid multimodal APIs; route the agent's image, TTS, or STT tool calls to a local model; or mentions Lemonade Server, OmniRouter, SD-Turbo, kokoro, Whisper, Ryzen AI, NPU/iGPU/dGPU inference, or "use local for images but cloud for chat". Run once per workspace; the rule it installs handles every later request.

Local AI Use (route image, TTS, STT through Lemonade)

This is a meta-skill. You run it once. After that, every later request that needs image generation, text-to-speech, or speech-to-text uses the local Lemonade Server instead of a cloud API. The agent's own LLM keeps handling text; only the expensive multimodal calls move on-device.

The skill does two things:

  1. Verifies that local Lemonade is reachable and has the right models.
  2. Drops a Local AI Use block into the workspace AGENTS.md so the agent reads the routing rule on every later turn, in Cursor, Claude Code, Codex, Gemini CLI, and any other agent that respects AGENTS.md.

When to use this skill

Use this skill when all of the following are true:

  • The user has, or is willing to install, the system-wide Lemonade Server.
  • The user accepts the default Lemonade endpoint http://localhost:13305.
  • The user wants the change to be persistent across future turns and agent restarts (the rule is written to disk).

If the user is instead embedding Lemonade as a private subprocess inside an app installer, do not use this skill; use local-ai-app-integration instead.

Prerequisites

  • OS: Windows 11 x64, Ubuntu/Debian x64, or macOS (beta).
  • Lemonade Server CLI on PATH: verify with lemonade --version. If missing, install from https://lemonade-server.ai/install_options.html before continuing. Do not silently install on the user's machine; that is a system-wide change and must be the user's call.
  • Disk: ~8 GB free for the three default models (SD-Turbo + Whisper-Tiny
    • kokoro-v1).
  • Network: required for the first lemonade pull of each model. After that, every modality runs offline.

The opinionated path

Run this checklist top to bottom. Track progress against it; do not move on until each step verifies.

[ ] 1. Confirm Lemonade Server is installed and reachable
[ ] 2. Pull the three default modality models
[ ] 3. Install the routing rule into the workspace AGENTS.md
[ ] 4. Smoke-test image, TTS, and STT against the local endpoint

The single command that does steps 1, 2, and 3 in one shot is:

python scripts/setup_local_ai.py

(Run from this skill's folder.) The script is idempotent: re-running it on a fully configured workspace is a no-op apart from a healthcheck. Read the sections below for what to do when each step fails.


Step 1: confirm Lemonade Server is reachable

Run:

lemonade status --json

Two acceptable outcomes:

lemonade status says Action
Server is running on port 13305 Continue to Step 2.
Server is not running Start it. On Windows, launch the Lemonade Start Menu shortcut. On Linux, run sudo systemctl start lemonade-server. Re-check lemonade status.

If lemonade is not on PATH at all, the server is not installed. Stop and point the user at https://lemonade-server.ai/install_options.html. Do not attempt a silent install.

The rest of this skill assumes the endpoint is http://localhost:13305/api/v1 and no API key is required (the system-wide server defaults to no auth on loopback). If the user has set LEMONADE_API_KEY, the routing rule template in templates/local-ai-rule.md shows where to add the Authorization header.

Step 2: pull the three default modality models

Pull these three. They are the Lite Collection defaults from Lemonade OmniRouter, sized to keep token-and-cost savings real on commodity hardware:

Modality Model Size Why this default
Image generation SD-Turbo ~5 GB Single-step generation, runs on CPU and AMD iGPU/dGPU
Text-to-speech kokoro-v1 ~0.3 GB Only TTS model Lemonade currently supports; CPU-only, low latency
Speech-to-text Whisper-Tiny ~0.1 GB Smallest Whisper; fast on CPU. Upgrade to Whisper-Large-v3-Turbo if accuracy matters more than latency.
lemonade pull SD-Turbo
lemonade pull kokoro-v1
lemonade pull Whisper-Tiny

Each pull is idempotent. To verify what is already downloaded:

lemonade list --downloaded

For coverage of larger / higher-quality alternatives (SDXL-Turbo, Flux-2-Klein-4B, Whisper-Large-v3-Turbo), see the model picker in reference.md.

Step 3: install the routing rule into AGENTS.md

The rule is a Markdown block stored in templates/local-ai-rule.md. Append it to the workspace's AGENTS.md (create the file if missing). Both Cursor and Claude Code load AGENTS.md automatically on every turn, so the agent will see the rule on its next message without any further setup.

scripts/setup_local_ai.py does this for you, surrounded by stable markers so re-running the script replaces the block in place rather than appending a second copy. The markers look like:

<!-- BEGIN amd-skills:local-ai-use -->
...rule...
<!-- END amd-skills:local-ai-use -->

If you write the file by hand, keep those exact markers. The script relies on them for idempotent updates.

If the user's agent only respects a different convention, mirror the same block to:

  • CLAUDE.md (Claude Code, project-scoped) or ~/.claude/CLAUDE.md (global)
  • .cursor/rules/local-ai-use.mdc (Cursor user/project rules)
  • GEMINI.md (Gemini CLI)

The rule's content is identical; only the file location changes.

Step 4: smoke-test the three modalities

Verify each modality against the live server before declaring success. These mirror the inline patterns in the installed rule, so a green pass here means the rule will work.

Image generation (writes out.png):

curl -sX POST http://localhost:13305/api/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"model":"SD-Turbo","prompt":"a single red apple on a white table","size":"512x512","steps":4,"response_format":"b64_json"}' \
  | python -c "import sys,json,base64; open('out.png','wb').write(base64.b64decode(json.load(sys.stdin)['data'][0]['b64_json']))"

Text-to-speech (writes out.mp3):

curl -sX POST http://localhost:13305/api/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"kokoro-v1","input":"Local AI is now active.","response_format":"mp3"}' \
  -o out.mp3

Speech-to-text (round-trips out.mp3 → text via a wav re-encode):

ffmpeg -y -i out.mp3 -ar 16000 -ac 1 out.wav
curl -sX POST http://localhost:13305/api/v1/audio/transcriptions \
  -F "file=@out.wav" -F "model=Whisper-Tiny"

If any of the three returns a non-2xx status, fix it now. The rule we just installed sends future requests to these same endpoints, so a broken endpoint becomes a broken user experience.


What changes after this skill runs

From the next turn onward, the agent reads the rule in AGENTS.md on every message. The rule explicitly tells the agent:

  • For image generation: call POST /api/v1/images/generations on the local server. Do not call any cloud image API and do not use the built-in GenerateImage tool (that path bills tokens to the cloud provider).
  • For text-to-speech: call POST /api/v1/audio/speech. Do not call cloud TTS providers (OpenAI TTS, ElevenLabs, etc.).
  • For speech-to-text: call POST /api/v1/audio/transcriptions. Do not call cloud transcription providers.
  • Fallback: only fall back to a cloud API after one local attempt has failed and the user has been told the local call failed. Never silently fall back; the whole point of this skill is to keep cost predictable.

The agent's own text reasoning continues to use whatever LLM Cursor / Claude Code / Codex is configured with. This skill does not redirect chat tokens; it only redirects the multimodal calls that would otherwise leave the machine.

Troubleshooting cheatsheet

Symptom Cause Recovery
lemonade: command not found Server CLI not installed Install from https://lemonade-server.ai/install_options.html; restart shell.
Server is not running Service stopped after install Windows: launch the Lemonade Start Menu shortcut. Linux: sudo systemctl start lemonade-server.
POST /v1/images/generations returns 404 model not found Image model not downloaded lemonade pull SD-Turbo and retry.
Image generation is slow on CPU (~4–5 min) sd-cpp on CPU backend Install the GPU backend on supported AMD hardware: lemonade backends install sd-cpp:rocm.
POST /v1/audio/transcriptions returns 400 unsupported format Input is not 16 kHz mono WAV Re-encode with ffmpeg -i in.* -ar 16000 -ac 1 out.wav.
POST /v1/audio/speech returns 404 TTS model not downloaded lemonade pull kokoro-v1.
401 Unauthorized on every request User has set LEMONADE_API_KEY Add Authorization: Bearer $LEMONADE_API_KEY to every request and to the rule block.

Verification checklist

Mark this skill complete only when all of the following are true:

  • lemonade status --json reports the server running on port 13305.
  • lemonade list --downloaded shows SD-Turbo, kokoro-v1, and Whisper-Tiny.
  • The workspace AGENTS.md contains the amd-skills:local-ai-use block.
  • All three smoke tests in Step 4 succeed.
  • On a follow-up turn, asking the agent to "generate an image of X" causes it to POST to http://localhost:13305/api/v1/images/generations rather than calling a cloud tool.

If any box is unchecked, the user is still paying cloud cost for at least one modality.


Reference

For the full model picker, alternate-quality options, the complete endpoint reference, the API-key flow, and the OmniRouter tool definitions you can hand to an agent's tool-calling loop, see reference.md.