Autonomous optimization loops for any codebase.
Install · Quickstart · Commands · How It Works · Writing Criteria
A CLI tool that generalizes Karpathy's autoresearch pattern beyond ML training. Define binary pass/fail criteria for any file, and let the optimization loop improve it automatically.
Works on documentation, system prompts, configs, accessibility — anything with a file you can mutate and a metric you can measure.
pip install git+https://github.com/dgallitelli/autoresearchctl.gitFor LLM judge evals (requires boto3 for Bedrock):
pip install "autoresearchctl[llm] @ git+https://github.com/dgallitelli/autoresearchctl.git"To contribute:
git clone https://github.com/dgallitelli/autoresearchctl.git
cd autoresearchctl
pip install -e ".[dev]"# 1. Initialize in your project
autoresearchctl init
# 2. Edit .autoresearch/config.yaml with your target and criteria
# 3. Measure current state
autoresearchctl eval
# 4. Run the optimization loop
autoresearchctl run --cycles 10
# 5. See what happened
autoresearchctl log
# 6. Compare before vs after
autoresearchctl diff 0 best# .autoresearch/config.yaml
target: "docs/*.md"
criteria:
- name: title_length
type: command
check: 'grep -q "^title:" {file} && [ $(grep "^title:" {file} | head -1 | sed "s/^title: *//" | tr -d "\"" | wc -c) -lt 60 ]'
- name: meta_description
type: command
check: 'grep -q "^description:" {file} && len=$(grep "^description:" {file} | head -1 | sed "s/^description: *//" | tr -d "\"" | wc -c) && [ $len -ge 120 ] && [ $len -le 160 ]'
- name: single_h1
type: command
check: '[ $(grep -c "^# " {file}) -eq 1 ]'
- name: internal_links
type: command
check: 'grep -qE "\[.*\]\([^)]*\.md\)" {file}'
max_cycles: 20
plateau_threshold: 3
mutator: rule-based$ autoresearchctl eval
File title_length meta_description single_h1 internal_links
------------------------------------------------------------------------------------------
api-reference.md FAIL FAIL FAIL FAIL
configuration.md FAIL FAIL FAIL FAIL
getting-started.md PASS PASS PASS PASS
------------------------------------------------------------------------------------------
Score: 4/12 (33%)
$ autoresearchctl run --cycles 5
============================================================
autoresearchctl run
============================================================
BASELINE: 4/12 (33%)
--- Cycle 1 (operator: add_constraint) ---
IMPROVED: 6/12 (50%)
+ api-reference.md: added meta description (150 chars)
+ configuration.md: added meta description (145 chars)
--- Cycle 3 (operator: add_counterexample) ---
IMPROVED: 8/12 (67%)
+ api-reference.md: demoted 2 extra H1(s) to H2
+ configuration.md: demoted 1 extra H1(s) to H2
============================================================
Final score: 8/12
============================================================
| Command | What it does |
|---|---|
autoresearchctl init |
Scaffold .autoresearch/ directory and template config |
autoresearchctl eval |
Measure current state against all criteria (no mutation) |
autoresearchctl run |
Execute the optimization loop |
autoresearchctl log |
Show score trends and operator analysis |
autoresearchctl diff <r1> <r2> |
Compare artifacts between two runs |
autoresearchctl rollback <run> |
Restore target files to a previous run's state |
autoresearchctl run --cycles 10 # Override max cycles
autoresearchctl run -c path/to/config # Custom config path
autoresearchctl eval -d /path/to/proj # Different project directory
The optimization loop follows the autoresearch pattern:
1. Evaluate current state (score = count of passing criteria)
2. Select next mutation operator from rotation
3. Apply mutation to target files
4. Re-evaluate
5. If score improved: keep changes (snapshot)
If score regressed: revert to pre-mutation state
6. Repeat until plateau, budget, or cycle limit
Operators rotate in fixed order — one per cycle for clean attribution:
| Operator | Strategy |
|---|---|
add_constraint |
Add a specific rule or requirement |
add_negative_example |
Show what NOT to do |
add_counterexample |
Before/after pair (wrong → right) |
tighten_language |
Replace vague with precise |
remove_bloat |
Cut redundant content |
restructure |
Reorder, reformat, change flow |
Command evals — Shell commands with exit code 0 = pass. Deterministic, free, fast.
- name: single_h1
type: command
check: '[ $(grep -c "^# " {file}) -eq 1 ]'LLM judge evals — For criteria requiring understanding (readability, tone). The judge sees only the output + criterion, ensuring evaluation isolation.
- name: alt_text_quality
type: llm_judge
check: "Do ALL images have descriptive, specific alt text?"Prefer command evals. Use LLM judges only when no shell command can express the criterion.
Good criteria are:
- Binary — pass or fail, no scales
- Independent — each criterion tests one thing
- Deterministic — same input → same result (especially command evals)
- Actionable — a failure suggests what to fix
The {file} placeholder in command evals is replaced with the absolute path to each target file.
All state lives in .autoresearch/:
.autoresearch/
config.yaml # Your optimization config
state.json # Current loop state (best score, cycle count)
results.jsonl # Full history of every cycle
history/
run-0000/ # Baseline snapshot
run-0001/ # Snapshot after cycle 1
...
Snapshots enable diff and rollback. The loop is resumable — run it again and it picks up where it left off.
These are baked into the tool's design:
- Mutate from the best, not the latest — Failed attempts aren't starting points
- One mutation per cycle — Clean attribution of what helped
- Eval in isolation — Judges see only output + criterion, no optimization context
- Files are truth, not memory — State re-read from disk every cycle
- Binary only — YES/NO. Scales drift. Binary gates hold
- Fixed validation set + coverage-first sampling — Apples-to-apples comparison
- Command evals over LLM evals — Deterministic, free, reproducible
MIT