Skip to content

claudlos/proof-of-autoresearch

Proof of Autoresearch

License: MIT Status: Proof of Concept Python 3.10+ Rust

Proof of Autoresearch banner

Proof of Autoresearch is a proof-of-concept for verified AI research mining.

The project explores a simple question:

Can a decentralized network reward people for discovering real improvements to AI training, while forcing every claim to be replayable and public?

This repository combines LLM-driven autoresearch, GPU experiment execution, content-addressed evidence bundles, statistical replay verification, and Bittensor-style miner/validator incentives. It is not a live subnet, not a mainnet-ready system, and not a profitability claim. It is a working research prototype intended to make the idea concrete enough for builders to inspect, run locally, critique, and extend.

Origin of the Idea

Modern AI research has a hard coordination problem. Useful training discoveries are expensive to find, easy to lose, and often hard to verify after the fact. Labs run thousands of private experiments, publish only a small fraction of the results, and leave the rest as invisible institutional memory. Independent researchers can have strong ideas but limited access to compute, incentives, and credible ways to prove that their results are real.

The immediate origin of this project is Karpathy's discussion of autoresearch in the No Priors interview on code agents and the "loopy era" of AI. His point was not just that an LLM can write code. It was that parts of research can be refactored so the human is no longer the step-by-step bottleneck: define an objective, define the metric, define the boundaries of what can change, then let workers try ideas.

In that framing, autoresearch starts to look like a research operating system. Ideas enter a queue. Workers pull an idea, turn it into a code change, run the experiment, read the metric, and keep only what improves the result. The output is not a paper or a chat transcript. The output is a sequence of commits that change training code and can be checked.

Karpathy then pushed the idea further: what if the workers are not trusted? What if they are just a large pool of people or agents on the internet contributing compute? In the transcript, he describes a system where "instead of blocks, you have commits" and where the hard work is searching through many experiments to find commits that actually improve the model. The missing piece is verification: an untrusted pool can search, but a trusted verification pool has to replay candidate commits and decide whether they are real.

Wei Dai connected that framing to proof-of-useful-work: can proof-of-work be built on top of autoresearch? The reason it is plausible is the same asymmetry:

Discovery is expensive. Verification is cheaper once a candidate exists.

A miner may spend many GPU hours trying failed ideas. But after a promising diff exists, a validator can replay the same experiment under controlled conditions and check whether the metric really improved.

Proof of Autoresearch is an attempt to flesh out that missing system. What is the claim object? What evidence has to be attached? How do commits build on each other? How does a verifier replay work safely? How should statistical noise be handled? How could the results become a public archive instead of a private experiment log?

AutoResearch-Chain and Bittensor are possible infrastructure for exploring those questions. ARC supplies a verification vocabulary: frozen surfaces, searchable surfaces, evidence hashes, commit-reveal, and replayable claims. Bittensor supplies a miner/validator network shape that can score useful work. This repo is the experiment of making that idea concrete enough to inspect.

The Core Thesis

Most proof-of-work systems spend compute to produce scarce but otherwise meaningless artifacts. AI research already spends compute to search for useful artifacts: better training recipes, optimizer settings, architecture tweaks, data handling changes, and reproducible metric improvements.

The thesis is that a network can treat a verified training improvement as the unit of useful work.

Role What they do What must be proven
Miner Searches for a code diff that improves a frozen training task The diff, config, logs, environment, and metrics are all captured
Validator Replays the submitted experiment independently The claimed improvement survives replay and statistical tests
Network Rewards verified improvements Weights are based on replayed evidence, not trust
Public Receives the output A reusable archive of verified research artifacts

If this works at scale, the output is not just emissions or leaderboard points. The output is an open corpus of training changes with evidence attached.

What This Repository Is

This repo is a public proof-of-concept for the protocol and subnet shape. The goal is to show the full logical path:

  1. Define a training domain with frozen files, searchable files, a metric, and a baseline.
  2. Let a miner propose and run a candidate training diff.
  3. Bundle the evidence as content-addressed artifacts.
  4. Verify a domain-bound commit-reveal proposal.
  5. Replay the claimed improvement.
  6. Score the result with statistical gates.
  7. Feed the verified score into validator weights.

The mocked smoke test exercises that whole path locally without GPU, IPFS, Docker, Bittensor, paid model APIs, or cloud compute providers. The production-facing code paths exist, but real multi-machine and testnet validation remain future work.

What Has Been Built So Far

Area Current state
Protocol model ResearchProposal synapse with 16 fields, evidence hashes, replay fields, and domain-bound commit-reveal
Evidence model Five BLAKE3 artifact hashes: diff, config, environment manifest, training log, and metrics output
Miner logic LLM diff loop, training execution adapters, evidence bundling, proposal construction, and local/remote execution paths
Validator logic Commit-reveal checks, evidence validation, replay hooks, score computation, and weight-update path
Reward model 8-stage scoring pipeline with replay delta, minimum-delta gate, paired bootstrap p-value, tolerance check, transfer bonus, ancestry decay, and clamping
Rust ARC CLI SQLite-backed local store, domain/frontier/block helpers, commit/reveal tracking, and BLAKE3 content-addressed store utilities
Domains nanochat-val_bpb-v1 is the active PoC domain; nanogpt-val_loss-v1 is present as a stub and not yet calibrated
Providers Generic interfaces for local model servers, remote model APIs, local GPUs, and rented GPU workers; live provider validation is intentionally not claimed
Tests 647 Python tests and 13 Rust tests
Coverage 71% total over arc_subnet/ and neurons/
CI/public repo prep GitHub Actions workflow, issue templates, PR template, Dependabot config, security policy, citation file, examples, and release docs

What Is Not Proven Yet

This distinction matters. The project should be read as a research prototype, not as a finished subnet.

Not proven Current boundary
Public subnet No public Bittensor testnet subnet has been launched from this repo
Live weight setting Validator permit and weight-setting behavior still need testnet validation
Real artifact network IPFS upload/fetch needs real multi-machine rehearsal
Real validator replay Docker replay needs a GPU validator rehearsal with hardened flags
Paid mining economics A remote-model plus rented-GPU run needs documented cost, runtime, and sanitized logs
Provider breadth Specific model and compute vendors need production wiring or live testing before being advertised as working
More domains nanochat is the only active real domain today
Sustained discoveries The project has not yet produced a sustained public run with statistically validated improvements beyond the calibrated baseline

See STATUS.md for the current release framing.

Limitations

This repository is us fleshing out the idea, not declaring the system solved. Karpathy's autoresearch work showed that the training-loop search process can be partially automated. This repo explores the surrounding protocol questions: what evidence should be captured, how claims could be replayed, how scoring might work, and what a public archive of verified improvements could look like.

Important limitations:

  • The smoke test is mocked. It proves the protocol shape, not real external operation.
  • The active domain is intentionally small. Success on a small training script does not imply useful discoveries on frontier-scale systems.
  • Reproducible GPU training is hard. Hardware variance, kernels, drivers, container images, and nondeterminism all need more work.
  • The economic model is speculative until real mining cost, replay cost, improvement frequency, and validator behavior are measured.
  • A live subnet could create bad incentives if scoring is weak, domains are too narrow, or validators cannot afford replay.
  • Provider integrations are abstractions at this stage. This README avoids treating any specific commercial model or GPU vendor as endorsed or proven.
  • The public-good claim depends on the archive being usable, reproducible, and open enough for researchers outside the network to benefit.

The point of publishing the PoC is to invite review of those assumptions.

How a Live Version Could Work

Verification flow

1. Public Research Domains

A domain defines the optimization task. For example:

  • seed codebase: a small training repository
  • frozen surface: files miners cannot modify, such as evaluation logic
  • search surface: files miners may modify, such as train.py
  • metric: a scalar target such as validation bits-per-byte
  • baseline: calibrated scores per hardware class

The domain is the public contract. Everyone knows what can change, what cannot change, and how success is measured.

2. Miner Search

A miner uses an LLM agent and GPU backend to search for improvements. A single iteration can look like this:

  1. Pull the current frontier for a domain.
  2. Ask an LLM to propose a training-code diff.
  3. Apply the diff to the searchable surface.
  4. Run training locally or on a rented GPU.
  5. Parse the metric.
  6. Keep promising results and discard failures.

Most attempts should fail. That is the point. The network pays for verified discoveries, not for every experiment someone runs.

3. Evidence Before Announcement

Before a miner announces a proposal, it bundles evidence:

  • the exact diff
  • experiment config
  • environment manifest
  • training log
  • metrics output

Each artifact is hashed with BLAKE3. In the intended live system, the artifacts are uploaded before the proposal is announced so validators can fetch and rehash them.

4. Commit-Reveal

The miner commits before revealing the full proposal:

sha3_256(diff_hash + salt + domain_id)

Binding the commit to domain_id prevents a valid improvement from one domain being replayed against another domain's baseline. This was one of the important runtime mismatches corrected during public-release preparation.

5. Validator Replay

Validators fetch evidence, rehash artifacts, rebuild the experiment, and replay the candidate under controlled conditions. The PoC scoring path uses K=5 replays and a paired bootstrap significance test.

A proposal is only useful if the replayed result supports the claim.

6. Scoring and Weights

The validator score is not just "metric went up." It passes through gates:

  1. replay delta must be positive
  2. improvement must clear the minimum-delta threshold
  3. bootstrap p-value must be significant
  4. claimed delta and replayed delta must be within tolerance
  5. score is scaled with diminishing returns
  6. transfer and ancestry logic can adjust reward
  7. final score is clamped into [0, 1]

In a live subnet, validator weights would be set from those verified scores.

7. Public Research Archive

The long-term output should be a public archive of:

  • proposed diffs
  • evidence bundles
  • replay results
  • hardware class metadata
  • validator decisions
  • domain frontiers

That archive is what makes this more than a mining game.

Why This Could Become a Global Public Good

If the system works, it could create public infrastructure for AI research in the same way package registries created public infrastructure for software.

Reproducibility by Default

Every rewarded claim would need artifacts, hashes, logs, configs, and replay results. That pushes the culture toward evidence-first research instead of unverifiable claims.

Open Access to Training Improvements

Small improvements often disappear inside private runs. A public network could turn those improvements into open records: code diffs with measured effects. Researchers, students, labs, and independent builders could inspect and reuse them.

Better Use of Global Compute

There is already a huge amount of undercoordinated GPU experimentation. A verified market could direct some of that compute toward reusable public research artifacts instead of private trial-and-error.

A Bridge for Independent Researchers

Independent researchers often lack the credibility layer that institutions provide. Replay verification can make the evidence portable: the work stands or falls by whether validators can reproduce it.

Shared Benchmarks for Useful Work

The system could help define a more rigorous version of proof-of-useful-work: not "I spent compute," but "I spent compute and produced a verified, reusable improvement."

A Public Dataset of Failed and Successful Search

The first reward path focuses on successful improvements. A mature system could also preserve failed experiments, near misses, and negative results. That would help future agents avoid rediscovering the same dead ends.

Not a Replacement for Science

This would not replace deep theory, peer review, or human taste. It would be a research substrate: a way to generate, verify, and publish small empirical training improvements continuously.

Run the Local Proof

This runs the mocked end-to-end path. It needs no GPU, no IPFS daemon, no Docker, no Bittensor node, and no paid API keys.

git clone https://github.com/claudlos/proof-of-autoresearch.git
cd proof-of-autoresearch

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

python scripts/smoke_test.py

Expected result:

ALL CHECKS PASSED

See docs/DEMO.md for the full walkthrough and exactly what is mocked.

Validation State

Current local validation for the public proof-of-concept candidate:

Check Status
Python package install Passing
Ruff lint Passing
Mypy Passing on arc_subnet/ and neurons/
Python tests 647 passing
Python coverage 71% total
Smoke test 7/7 mocked checks passing
Fresh release-copy install Passing with a new venv from a temporary copy
Rust format Passing
Rust build Passing
Rust tests 13 passing
Rust clippy Passing
Markdown local links Passing
Example JSON Passing
SVG XML Passing
Public-candidate secret scan No real secrets found

Validation commands:

python -m ruff check arc_subnet/ neurons/ tests/ scripts/
python -m mypy arc_subnet/ neurons/ --ignore-missing-imports
python scripts/smoke_test.py
python -m pytest tests/ -q --tb=short
python -m pytest tests/ --cov=arc_subnet --cov=neurons --cov-report=term-missing -q --tb=short

cargo fmt --all --check
cargo build --release --workspace
cargo test --workspace
cargo clippy --workspace -- -D warnings

See docs/VALIDATION.md for the validation report.

Generic Integration Shapes

These commands are for future integration work. They are not proof of public testnet or mainnet readiness, and the placeholder backend names are not meant to advertise any specific vendor.

Remote Model API plus Rented GPU Worker

REMOTE_MODEL_API_KEY="<remote-model-api-key>" \
GPU_RENTAL_API_KEY="<gpu-rental-api-key>" \
ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
  --netuid <NETUID> \
  --wallet.name miner \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --miner.llm_backend <remote-model-backend> \
  --miner.train_backend <cloud-gpu-backend> \
  --miner.cloud_gpu "<gpu-class>" \
  --miner.cloud_max_price 0.35

Local LLM plus Local GPU

ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
  --netuid <NETUID> \
  --wallet.name miner \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --miner.llm_backend local \
  --miner.llm_endpoint http://localhost:8000/v1 \
  --miner.train_backend local

Validator

ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py validator \
  --netuid <NETUID> \
  --wallet.name validator \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --validator.timeout 1800 \
  --epoch_length 150

Repository Map

arc_subnet/
  protocol.py          ResearchProposal synapse and validation helpers
  reward.py            8-stage reward/scoring pipeline
  arc_client.py        Python wrapper around the Rust arc-cli
  llm_providers.py     LLM provider registry
  cloud_providers.py   GPU backend registry
  metrics.py           Prometheus metrics helpers

neurons/
  miner.py             Miner agent loop, training adapters, evidence bundling
  validator.py         Validator replay, scoring, and weight update logic
  production.py        Bittensor SDK wiring and CLI entry point
  domains/             Domain definitions and registry

crates/
  core/                Rust ARC store, SQLite DB, IPFS helpers, types
  cli/                 Rust arc-cli commands

docs/                  Public concept, demo, architecture, validation, FAQ
paper/                 Technical, economics, and verification papers
examples/              Proposal, evidence, and score JSON examples
assets/brand/          README banner, mark, and verification-flow diagram
scripts/               Smoke test and local helper scripts
tests/                 Python test suite

Roadmap

The short version:

  1. Clarify the idea, limitations, and validation boundary.
  2. Keep the mocked local proof easy to run.
  3. Define the real-world experiments needed to test the idea.
  4. Validate storage, replay, model, compute, and chain assumptions one at a time.
  5. Add another calibrated domain only after the first one is understood.
  6. Treat testnet planning as a later research milestone, not a launch promise.

See ROADMAP.md for the public roadmap.

Contributing

This project is best suited for builders who want to make the idea more real, not for users looking for a finished subnet.

Good first contributions:

  • run the fresh-clone rehearsal and report setup drift
  • validate one real integration path: artifact storage, Docker replay, rented GPU execution, remote model inference, or testnet weight-setting
  • compute and document frozen-surface hashes for the active nanochat domain
  • improve docs around adding a second real domain after nanochat
  • add tests before widening provider claims

The remaining integration gaps are listed in STATUS.md, ROADMAP.md, and .github/ISSUE_SEEDS.md. See CONTRIBUTING.md for development setup.

Security

This is a research prototype that executes training code and shells out to GPU, Docker, IPFS, and Bittensor tooling. Do not run it with production wallets, valuable hotkeys, or broad cloud permissions. See SECURITY.md for reporting guidance and operational cautions.

Further Reading

Acknowledgments

  • Andrej Karpathy - autoresearch and the No Priors discussion that framed autoresearch as workers searching for useful training-code commits.
  • Wei Dai - the proof-of-useful-work question that connected autoresearch to discovery/verification asymmetry.
  • Bittensor / Opentensor Foundation - the subnet and validator/miner incentive model this prototype targets.
  • AutoResearch-Chain (ARC) - the verification protocol that inspired the evidence and replay model.

License

MIT - see LICENSE.

Releases

No releases published

Packages

 
 
 

Contributors