Automatically optimize your agent skills using a multi-agent system built with Google ADK (Agent Development Kit) and Gemini 3. Upload a skill, let the agents generate test scenarios and evaluation criteria, then watch as three specialized ADK agents collaborate to improve your skill through iterative optimization.
This app implements an automated skill improvement loop inspired by Karpathy's autoresearch methodology, powered by a team of ADK agents:
- Upload: Drop in your skill folder (following agentskills.io spec)
- Configure: AI generates test scenarios and evaluation criteria. Edit, add, or regenerate as needed
- Optimize: Three ADK agents collaborate — one executes, one diagnoses failures, one applies fixes
- Results: Download your improved skill with a detailed changelog
| Agent | Role | What It Does |
|---|---|---|
| Executor | Skill Runner | Faithfully executes the skill against test scenarios, producing outputs exactly as the skill instructs |
| Analyst | Failure Diagnostician | Examines failed evaluations, identifies root causes, and recommends a specific mutation strategy |
| Mutator | Prompt Editor | Makes exactly ONE targeted change to the skill prompt based on the analyst's diagnosis |
- The Executor agent runs the skill against all test scenarios
- Outputs are scored using binary yes/no evaluation criteria
- The Analyst agent diagnoses failure patterns and picks a strategy (
add_example,add_constraint,restructure, oradd_edge_case) - The Mutator agent applies ONE surgical fix to the skill prompt
- The modified skill is re-tested
- Changes are kept if the score improves, reverted if not
- Repeats until the target pass rate is reached or max rounds hit
self-improving-agent-skills/
├── backend/ # FastAPI server + ADK optimization engine
│ ├── app.py # REST API endpoints + SSE streaming
│ ├── adk_optimizer.py # Multi-agent optimizer (Executor, Analyst, Mutator)
│ ├── requirements.txt
│ └── optimizer.py # Legacy single-model optimizer (unused)
├── frontend/ # Next.js + React + Tailwind
│ ├── src/
│ │ ├── app/ # Main page + layout
│ │ └── components/ # Upload, Config, Running, Results steps
│ ├── package.json
│ └── *.config.ts
├── example_skills/ # Sample skills to test
│ ├── code-reviewer/
│ └── content-writer/
└── README.md
- Backend: Python 3.10+, FastAPI, Google ADK, google-genai SDK
- Frontend: Next.js 15, React 19, Tailwind CSS v4, Recharts
- AI: Google ADK multi-agent system with Gemini 3 Flash for execution, analysis, and mutation
- Real-time: Server-Sent Events (SSE) for live optimization progress
cd backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment (optional, can also pass via header)
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
# Run server
python app.py
# Server runs on http://localhost:8891cd frontend
# Install dependencies
npm install
# Run development server
npm run dev
# App runs on http://localhost:3000- Get a Gemini API key from Google AI Studio
- Open http://localhost:3000
- Upload a skill folder as a .zip file (or try an example)
- Enter your Gemini API key
- Review and edit the generated test scenarios and evaluation criteria
- Click "Start Optimization" and watch the agents collaborate to improve your skill
- Download your improved skill when complete
Skills follow the agentskills.io specification:
my-skill/
├── SKILL.md # Required: YAML frontmatter + instructions
├── scripts/ # Optional: executable code
├── references/ # Optional: additional docs
└── assets/ # Optional: templates, resources
Example SKILL.md:
---
name: my-skill
description: What this skill does and when to use it
license: MIT
metadata:
author: your-name
version: "1.0"
---
# My Skill
Your skill instructions here...Two example skills are included:
- code-reviewer: Reviews code for security, performance, and best practices
- content-writer: Writes marketing copy following style guidelines
Create a zip file from an example:
cd example_skills
zip -r code-reviewer.zip code-reviewer/Then upload the zip in the app.
Gemini analyzes your skill and generates:
- 3-4 diverse test scenarios
- 4-6 binary evaluation criteria (yes/no questions)
The Executor agent runs the skill against all scenarios. Each output is scored against all evaluation criteria. This establishes the starting score.
For each round, the three agents collaborate:
- Executor runs the skill against all test scenarios
- Outputs are scored against evaluation criteria
- Analyst examines failures, identifies root cause, and selects a mutation strategy
- Mutator applies ONE specific change to improve the skill
- Executor re-runs the modified skill
- Score is compared — keep the change if improved, revert if not
- Repeat until target pass rate or max rounds reached
- Improved SKILL.md with all successful changes applied
- Detailed changelog of what changed and why
- Performance comparison (baseline vs final)
POST /api/upload- Upload skill zip filePOST /api/upload-files- Upload multiple files (folder upload)POST /api/analyze- Generate scenarios and evals (requires Gemini API key)POST /api/regenerate- Regenerate scenarios and evalsPOST /api/update-config- Save user's selected/edited configPOST /api/start/{session_id}- Start optimizationGET /api/stream/{session_id}- SSE stream of optimization progressPOST /api/stop/{session_id}- Stop optimizationGET /api/download/{session_id}- Download improved skillGET /api/examples- List available example skillsPOST /api/examples/{name}/load- Load an example skillGET /api/status/{session_id}- Poll-based status endpoint
Set GEMINI_API_KEY in .env or pass via request header. Server runs on port 8891.
API key is stored in component state (not persisted) and sent with each request.
In RunningStep.tsx, adjust max_rounds:
body: JSON.stringify({
max_rounds: 20, // Default: 20
}),In adk_optimizer.py, adjust the model:
def __init__(self, api_key: str, model: str = "gemini-3-flash-preview"):cd backend
python -c "from adk_optimizer import SkillOptimizer; print('OK')"cd frontend
npm run buildBoth servers support hot reload. Edit code and see changes immediately.
This tool applies Andrej Karpathy's autoresearch methodology (using LLMs to iteratively improve their own prompts) to agent skills. The key insight: rather than manually tweaking prompts, define success criteria and let the AI optimize itself — now powered by a team of specialized ADK agents.
Original concept: https://github.com/karpathy/autoresearch