Skip to content

dgallitelli/autoresearchctl

Repository files navigation

autoresearchctl

Autonomous optimization loops for any codebase.

Install · Quickstart · Commands · How It Works · Writing Criteria


A CLI tool that generalizes Karpathy's autoresearch pattern beyond ML training. Define binary pass/fail criteria for any file, and let the optimization loop improve it automatically.

Works on documentation, system prompts, configs, accessibility — anything with a file you can mutate and a metric you can measure.

Install

pip install git+https://github.com/dgallitelli/autoresearchctl.git

For LLM judge evals (requires boto3 for Bedrock):

pip install "autoresearchctl[llm] @ git+https://github.com/dgallitelli/autoresearchctl.git"

To contribute:

git clone https://github.com/dgallitelli/autoresearchctl.git
cd autoresearchctl
pip install -e ".[dev]"

Quickstart

# 1. Initialize in your project
autoresearchctl init

# 2. Edit .autoresearch/config.yaml with your target and criteria

# 3. Measure current state
autoresearchctl eval

# 4. Run the optimization loop
autoresearchctl run --cycles 10

# 5. See what happened
autoresearchctl log

# 6. Compare before vs after
autoresearchctl diff 0 best

Example: SEO-optimize documentation

# .autoresearch/config.yaml
target: "docs/*.md"

criteria:
  - name: title_length
    type: command
    check: 'grep -q "^title:" {file} && [ $(grep "^title:" {file} | head -1 | sed "s/^title: *//" | tr -d "\"" | wc -c) -lt 60 ]'

  - name: meta_description
    type: command
    check: 'grep -q "^description:" {file} && len=$(grep "^description:" {file} | head -1 | sed "s/^description: *//" | tr -d "\"" | wc -c) && [ $len -ge 120 ] && [ $len -le 160 ]'

  - name: single_h1
    type: command
    check: '[ $(grep -c "^# " {file}) -eq 1 ]'

  - name: internal_links
    type: command
    check: 'grep -qE "\[.*\]\([^)]*\.md\)" {file}'

max_cycles: 20
plateau_threshold: 3
mutator: rule-based
$ autoresearchctl eval
File                          title_length  meta_description     single_h1  internal_links
------------------------------------------------------------------------------------------
api-reference.md                      FAIL          FAIL          FAIL          FAIL
configuration.md                      FAIL          FAIL          FAIL          FAIL
getting-started.md                    PASS          PASS          PASS          PASS
------------------------------------------------------------------------------------------
Score: 4/12 (33%)

$ autoresearchctl run --cycles 5
============================================================
autoresearchctl run
============================================================
BASELINE: 4/12 (33%)

--- Cycle 1 (operator: add_constraint) ---
IMPROVED: 6/12 (50%)
  + api-reference.md: added meta description (150 chars)
  + configuration.md: added meta description (145 chars)

--- Cycle 3 (operator: add_counterexample) ---
IMPROVED: 8/12 (67%)
  + api-reference.md: demoted 2 extra H1(s) to H2
  + configuration.md: demoted 1 extra H1(s) to H2

============================================================
Final score: 8/12
============================================================

Commands

Command What it does
autoresearchctl init Scaffold .autoresearch/ directory and template config
autoresearchctl eval Measure current state against all criteria (no mutation)
autoresearchctl run Execute the optimization loop
autoresearchctl log Show score trends and operator analysis
autoresearchctl diff <r1> <r2> Compare artifacts between two runs
autoresearchctl rollback <run> Restore target files to a previous run's state

Common flags

autoresearchctl run --cycles 10        # Override max cycles
autoresearchctl run -c path/to/config  # Custom config path
autoresearchctl eval -d /path/to/proj  # Different project directory

How It Works

The optimization loop follows the autoresearch pattern:

1. Evaluate current state (score = count of passing criteria)
2. Select next mutation operator from rotation
3. Apply mutation to target files
4. Re-evaluate
5. If score improved: keep changes (snapshot)
   If score regressed: revert to pre-mutation state
6. Repeat until plateau, budget, or cycle limit

Six mutation operators

Operators rotate in fixed order — one per cycle for clean attribution:

Operator Strategy
add_constraint Add a specific rule or requirement
add_negative_example Show what NOT to do
add_counterexample Before/after pair (wrong → right)
tighten_language Replace vague with precise
remove_bloat Cut redundant content
restructure Reorder, reformat, change flow

Two eval types

Command evals — Shell commands with exit code 0 = pass. Deterministic, free, fast.

- name: single_h1
  type: command
  check: '[ $(grep -c "^# " {file}) -eq 1 ]'

LLM judge evals — For criteria requiring understanding (readability, tone). The judge sees only the output + criterion, ensuring evaluation isolation.

- name: alt_text_quality
  type: llm_judge
  check: "Do ALL images have descriptive, specific alt text?"

Prefer command evals. Use LLM judges only when no shell command can express the criterion.

Writing Criteria

Good criteria are:

  • Binary — pass or fail, no scales
  • Independent — each criterion tests one thing
  • Deterministic — same input → same result (especially command evals)
  • Actionable — a failure suggests what to fix

The {file} placeholder in command evals is replaced with the absolute path to each target file.

State Management

All state lives in .autoresearch/:

.autoresearch/
  config.yaml       # Your optimization config
  state.json        # Current loop state (best score, cycle count)
  results.jsonl     # Full history of every cycle
  history/
    run-0000/       # Baseline snapshot
    run-0001/       # Snapshot after cycle 1
    ...

Snapshots enable diff and rollback. The loop is resumable — run it again and it picks up where it left off.

Seven Principles

These are baked into the tool's design:

  1. Mutate from the best, not the latest — Failed attempts aren't starting points
  2. One mutation per cycle — Clean attribution of what helped
  3. Eval in isolation — Judges see only output + criterion, no optimization context
  4. Files are truth, not memory — State re-read from disk every cycle
  5. Binary only — YES/NO. Scales drift. Binary gates hold
  6. Fixed validation set + coverage-first sampling — Apples-to-apples comparison
  7. Command evals over LLM evals — Deterministic, free, reproducible

License

MIT

About

Autonomous optimization loops for any codebase — a CLI that generalizes Karpathy's autoresearch pattern

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages