RoboPhD evolves AI agents to improve task performance without human intervention or author-supplied domain knowledge. It implements a closed-loop evolution cycle where an Evolution agent designs new versions of task agents based on performance feedback, using ELO-based competition to select the best agents across iterations.
Tested across four benchmarks with diverse task types — abstract reasoning, cloud scheduling, SQL generation, and financial document QA. All runs use a fixed budget of 1,500 evaluations. Scores show test set performance; numbers in parentheses are agent lines of code.
| Benchmark | Seed | RoboPhD | GEPA | Autoresearch |
|---|---|---|---|---|
| ARC-AGI (%) | 27.8 (22) | 65.8 (1,013) | 58.5 (366) | 54.2 (304) |
| Can't Be Late | -96.5 (31) | -90.7 (148) | -89.3 (142) | -87.6 (87) |
| Text2SQL (BIRD) (%) | 52.2 (96) | 64.5 (602) | 60.4 (498) | 60.7 (265) |
| DocFinQA (%) | 17.7 (29) | 50.4 (825) | 40.0 (207) | 48.2 (198) |
Can't Be Late scores are negative costs (higher = better).
Using a single default configuration, RoboPhD outperforms both GEPA and Autoresearch on three of four benchmarks, losing only on Can't Be Late — the simplest task, where the winning solution required just 87 lines of code. On the three complex benchmarks, RoboPhD's multi-iteration Elo competition produces substantially larger agents (602–1,013 lines) that combine strategies discovered across many evolutionary cycles.
RoboPhD uses AI throughout:
- Task Execution: Solver agents execute domain tasks (SQL generation, puzzle solving, scheduling)
- Evolution: Claude Code agents evolve increasingly better task agents
- Infrastructure: The authors used Claude Code to build the RoboPhD system
┌─────────────────────────────────────────────────────────────┐
│ ITERATION CYCLE │
│ │
│ ┌──────────────────┐ ┌───────────────────-─┐ │
│ │ EVOLUTION AI │ Creates │ AGENT ARTIFACTS │ │
│ │ (Claude Code │────────▶│ (per file_mapping) │ │
│ │ CLI session) │ └────────┬──────────-─┘ │
│ └──────────────────┘ │ │
│ ▲ ▼ │
│ │ ┌───────────────────-─┐ │
│ Performance │ EVALUATOR FN │ │
│ data from │ Black-box scoring │ │
│ prior iterations │ (candidate,example)│ │
│ │ │ → (score, diag) │ │
│ │ └────────┬───────────-┘ │
│ │ │ │
│ │ ▼ │
│ ┌────────┴─────────┐ ┌────────────────────┐ │
│ │ AGENT RANKINGS │◀────────│ ELO COMPETITION │ │
│ │ Top agents │ │ Head-to-head on │ │
│ │ inform next │ │ sampled problems │ │
│ │ evolution round │ └────────────────────┘ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
| Domain | Benchmark | What Evolves | Solver Model |
|---|---|---|---|
| ARC-AGI | ARC-AGI (HuggingFace) | agent.py — Python solver with solve() |
Gemini 3.1 Flash Lite (via OpenRouter) |
| Can't Be Late | AWS spot traces (NSDI'24) | agent.py — scheduling strategy class |
Pure algorithmic (no LLM) |
| Text2SQL | BIRD | agent.py + analyze_db.py — SQL generation with llm() + test_sql() |
Claude Haiku 4.5 |
| DocFinQA | DocFinQA (ACL 2024) | agent.py — retrieval + QA pipeline |
GPT-4.1-mini + text-embedding-3-small |
Each domain has a self-contained example under examples/ with evaluator, seed agent, and documentation. More examples coming soon.
# 1. Clone and install
git clone https://github.com/andborth/RoboPhD.git
cd RoboPhD
pip install -r requirements.txt
pip install -r requirements-gepa.txt # adds dspy, datasets
# 2. Install Claude Code CLI (required for evolution)
# See: https://docs.anthropic.com/en/docs/claude-code
# 3. Set API keys
export ANTHROPIC_API_KEY_FOR_ROBOPHD="your_key" # for evolution (Claude Code)
export OPENAI_API_KEY="sk-..." # for DocFinQA (gpt-4.1-mini + embeddings)
export OPENROUTER_API_KEY="sk-or-..." # for ARC-AGI (Gemini via OpenRouter)
# Recommended: link your Google API key at https://openrouter.ai/settings/integrations
# to get your own Gemini rate limits (otherwise you share limits with all OpenRouter users)
# 4. DocFinQA — financial document QA (easiest to start with)
python examples/docfinqa/main.py --num-iterations 2
# 5. ARC-AGI-1 — abstract reasoning (requires OpenRouter key above)
python examples/arc_agi_1/main.py --num-iterations 2
# 6. Can't Be Late — cloud scheduling (no solver API key needed)
bash examples/cant_be_late/download_traces.sh
python examples/cant_be_late/main.py --num-iterations 2
# 7. Text2SQL — SQL generation from natural language (BIRD benchmark)
bash benchmark_resources/download_bird.sh
python examples/text2sql/main.py --num-iterations 2Use optimize_anything() to evolve any text artifact with your own evaluator:
from RoboPhD import optimize_anything, RoboPhDConfig
def evaluator(candidate, example, *, problem_dir=None):
prompt = candidate["system_prompt"]
# Call your LLM, run your code, score the result...
score = 1.0 if correct else 0.0
return score, {
"score": score,
"predicted_answer": predicted,
# String values are written as files for the evolution AI to read
"question.md": example["question"],
"response.md": response_text,
}
result = optimize_anything(
evaluator=evaluator,
dataset=[{"id": "1", "question": "...", "answer": "..."}],
seed_candidate={"system_prompt": "Your initial prompt here"},
objective="Maximize accuracy on my task",
config=RoboPhDConfig(num_iterations=5, evaluation_budget=200),
)
print(result.best_candidate["system_prompt"])
print(f"Best ELO: {result.best_score}")Resume & extend — result.experiment_dir points to the checkpoint directory, so you can always resume:
# Resume from where it left off
result = optimize_anything(
evaluator=evaluator, dataset=my_dataset, objective="Maximize accuracy",
config=RoboPhDConfig(experiment_dir=result.experiment_dir),
)
# Extend by 5 more iterations
result = optimize_anything(
evaluator=evaluator, dataset=my_dataset, objective="Maximize accuracy",
config=RoboPhDConfig(experiment_dir=result.experiment_dir, extend_iterations=5),
)Note: seed_candidate is only needed for the initial run — on resume, the file mapping is recovered from the checkpoint. evaluator and dataset are always required (they can't be serialized).
Evaluating candidates — use eval_candidate() to evaluate any candidate on a dataset:
from RoboPhD import eval_candidate, RoboPhDEvalConfig
eval_result = eval_candidate(
evaluator=evaluator,
dataset=test_dataset,
candidate=result.best_candidate,
config=RoboPhDEvalConfig(test_repeats=3, max_workers=8),
)
print(f"Accuracy: {eval_result.mean_score:.1%} ({eval_result.num_examples} examples)")See RoboPhD/api.py for the full API reference.
Built-in strategies in RoboPhD/evolution_strategies/:
use_your_judgment— Open-ended: study agents, data, and failure patterns (default)data_focus— Data-first: explore problem-level outputs before studying agent coderefinement— Iteratively improve a single base agentcross_pollination— Combine patterns from multiple successful agents
# Full run with test-set evaluation
python examples/arc_agi_1/main.py --eval-test-set
# Use paper configuration (stronger model, higher cost budget)
python examples/arc_agi_1/main.py --paper-config
# Custom engine config
python examples/arc_agi_1/main.py --engine-config '{"include_evolution_rankings": false}'
# Resume a run
python examples/arc_agi_1/main.py --resume ../robophd_runs/robophd/optimize_anything_20260401_120000
# Extend by 5 more iterations
python examples/arc_agi_1/main.py --resume <dir> --extend 5Multi-engine support: GEPA and Autoresearch engine selection via config is coming soon. In the meantime, these engines are available via scripts/run_gepa.py and scripts/run_autoresearch.py.
- Python 3.10+
- Claude Code CLI (required for evolution)
pip install -r requirements-gepa.txt(for ARC-AGI dataset loading)- For ARC-AGI:
OPENROUTER_API_KEYenvironment variable - For Text2SQL:
ANTHROPIC_API_KEY_FOR_ROBOPHD+ ~50GB for BIRD dataset - For DocFinQA: OpenAI API key (for gpt-4.1-mini and embeddings)
- For Can't Be Late: trace data via
bash scripts/download_cant_be_late_traces.sh
RoboPhD builds on several excellent open-source projects and benchmarks:
- GEPA (Agrawal et al., 2025) — reflective text evolution with Pareto selection
- Autoresearch (Karpathy, 2026) — single-session greedy experimentation
- ARC Prize / ARC-AGI (Chollet, 2019) — abstract reasoning benchmark
- BIRD (Li et al., 2024) — Text-to-SQL benchmark
- DocFinQA (Reddy et al., 2024) — long-context financial QA benchmark
- Can't Be Late (Wu et al., 2024) — cloud spot instance scheduling
If you use RoboPhD in your research, please cite:
@article{borthwick2026robophd,
title={RoboPhD: Evolving Diverse Complex Agents Under Tight Evaluation Budgets},
author={Borthwick, Andrew and Ash, Stephen and Galczak, Anthony},
journal={arXiv preprint arXiv:2604.04347},
year={2026}
}