GitHub - dgallitelli/autoresearchctl: Autonomous optimization loops for any codebase — a CLI that generalizes Karpathy's autoresearch pattern

Autonomous optimization loops for any codebase.

Install · Quickstart · Commands · How It Works · Writing Criteria

A CLI tool that generalizes Karpathy's autoresearch pattern beyond ML training. Define binary pass/fail criteria for any file, and let the optimization loop improve it automatically.

Works on documentation, system prompts, configs, accessibility — anything with a file you can mutate and a metric you can measure.

Install

pip install git+https://github.com/dgallitelli/autoresearchctl.git

For LLM judge evals (requires boto3 for Bedrock):

pip install "autoresearchctl[llm] @ git+https://github.com/dgallitelli/autoresearchctl.git"

To contribute:

git clone https://github.com/dgallitelli/autoresearchctl.git
cd autoresearchctl
pip install -e ".[dev]"

Quickstart

# 1. Initialize in your project
autoresearchctl init

# 2. Edit .autoresearch/config.yaml with your target and criteria

# 3. Measure current state
autoresearchctl eval

# 4. Run the optimization loop
autoresearchctl run --cycles 10

# 5. See what happened
autoresearchctl log

# 6. Compare before vs after
autoresearchctl diff 0 best

Example: SEO-optimize documentation

# .autoresearch/config.yaml
target: "docs/*.md"

criteria:
  - name: title_length
    type: command
    check: 'grep -q "^title:" {file} && [ $(grep "^title:" {file} | head -1 | sed "s/^title: *//" | tr -d "\"" | wc -c) -lt 60 ]'

  - name: meta_description
    type: command
    check: 'grep -q "^description:" {file} && len=$(grep "^description:" {file} | head -1 | sed "s/^description: *//" | tr -d "\"" | wc -c) && [ $len -ge 120 ] && [ $len -le 160 ]'

  - name: single_h1
    type: command
    check: '[ $(grep -c "^# " {file}) -eq 1 ]'

  - name: internal_links
    type: command
    check: 'grep -qE "\[.*\]\([^)]*\.md\)" {file}'

max_cycles: 20
plateau_threshold: 3
mutator: rule-based

$ autoresearchctl eval
File                          title_length  meta_description     single_h1  internal_links
------------------------------------------------------------------------------------------
api-reference.md                      FAIL          FAIL          FAIL          FAIL
configuration.md                      FAIL          FAIL          FAIL          FAIL
getting-started.md                    PASS          PASS          PASS          PASS
------------------------------------------------------------------------------------------
Score: 4/12 (33%)

$ autoresearchctl run --cycles 5
============================================================
autoresearchctl run
============================================================
BASELINE: 4/12 (33%)

--- Cycle 1 (operator: add_constraint) ---
IMPROVED: 6/12 (50%)
  + api-reference.md: added meta description (150 chars)
  + configuration.md: added meta description (145 chars)

--- Cycle 3 (operator: add_counterexample) ---
IMPROVED: 8/12 (67%)
  + api-reference.md: demoted 2 extra H1(s) to H2
  + configuration.md: demoted 1 extra H1(s) to H2

============================================================
Final score: 8/12
============================================================

Commands

Command	What it does
`autoresearchctl init`	Scaffold `.autoresearch/` directory and template config
`autoresearchctl eval`	Measure current state against all criteria (no mutation)
`autoresearchctl run`	Execute the optimization loop
`autoresearchctl log`	Show score trends and operator analysis
`autoresearchctl diff <r1> <r2>`	Compare artifacts between two runs
`autoresearchctl rollback <run>`	Restore target files to a previous run's state

Common flags

autoresearchctl run --cycles 10        # Override max cycles
autoresearchctl run -c path/to/config  # Custom config path
autoresearchctl eval -d /path/to/proj  # Different project directory

How It Works

The optimization loop follows the autoresearch pattern:

1. Evaluate current state (score = count of passing criteria)
2. Select next mutation operator from rotation
3. Apply mutation to target files
4. Re-evaluate
5. If score improved: keep changes (snapshot)
   If score regressed: revert to pre-mutation state
6. Repeat until plateau, budget, or cycle limit

Six mutation operators

Operators rotate in fixed order — one per cycle for clean attribution:

Operator	Strategy
`add_constraint`	Add a specific rule or requirement
`add_negative_example`	Show what NOT to do
`add_counterexample`	Before/after pair (wrong → right)
`tighten_language`	Replace vague with precise
`remove_bloat`	Cut redundant content
`restructure`	Reorder, reformat, change flow

Two eval types

Command evals — Shell commands with exit code 0 = pass. Deterministic, free, fast.

- name: single_h1
  type: command
  check: '[ $(grep -c "^# " {file}) -eq 1 ]'

LLM judge evals — For criteria requiring understanding (readability, tone). The judge sees only the output + criterion, ensuring evaluation isolation.

- name: alt_text_quality
  type: llm_judge
  check: "Do ALL images have descriptive, specific alt text?"

Prefer command evals. Use LLM judges only when no shell command can express the criterion.

Writing Criteria

Good criteria are:

Binary — pass or fail, no scales
Independent — each criterion tests one thing
Deterministic — same input → same result (especially command evals)
Actionable — a failure suggests what to fix

The {file} placeholder in command evals is replaced with the absolute path to each target file.

State Management

All state lives in .autoresearch/:

.autoresearch/
  config.yaml       # Your optimization config
  state.json        # Current loop state (best score, cycle count)
  results.jsonl     # Full history of every cycle
  history/
    run-0000/       # Baseline snapshot
    run-0001/       # Snapshot after cycle 1
    ...

Snapshots enable diff and rollback. The loop is resumable — run it again and it picks up where it left off.

Seven Principles

These are baked into the tool's design:

Mutate from the best, not the latest — Failed attempts aren't starting points
One mutation per cycle — Clean attribution of what helped
Eval in isolation — Judges see only output + criterion, no optimization context
Files are truth, not memory — State re-read from disk every cycle
Binary only — YES/NO. Scales drift. Binary gates hold
Fixed validation set + coverage-first sampling — Apples-to-apples comparison
Command evals over LLM evals — Deterministic, free, reproducible

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/autoresearchctl		src/autoresearchctl
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
autoresearchctl.svg		autoresearchctl.svg
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Quickstart

Example: SEO-optimize documentation

Commands

Common flags

How It Works

Six mutation operators

Two eval types

Writing Criteria

State Management

Seven Principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Quickstart

Example: SEO-optimize documentation

Commands

Common flags

How It Works

Six mutation operators

Two eval types

Writing Criteria

State Management

Seven Principles

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages