Proof of Autoresearch

Proof of Autoresearch is a proof-of-concept for verified AI research mining.

The project explores a simple question:

Can a decentralized network reward people for discovering real improvements to AI training, while forcing every claim to be replayable and public?

This repository combines LLM-driven autoresearch, GPU experiment execution, content-addressed evidence bundles, statistical replay verification, and Bittensor-style miner/validator incentives. It is not a live subnet, not a mainnet-ready system, and not a profitability claim. It is a working research prototype intended to make the idea concrete enough for builders to inspect, run locally, critique, and extend.

Origin of the Idea

Modern AI research has a hard coordination problem. Useful training discoveries are expensive to find, easy to lose, and often hard to verify after the fact. Labs run thousands of private experiments, publish only a small fraction of the results, and leave the rest as invisible institutional memory. Independent researchers can have strong ideas but limited access to compute, incentives, and credible ways to prove that their results are real.

The immediate origin of this project is Karpathy's discussion of autoresearch in the No Priors interview on code agents and the "loopy era" of AI. His point was not just that an LLM can write code. It was that parts of research can be refactored so the human is no longer the step-by-step bottleneck: define an objective, define the metric, define the boundaries of what can change, then let workers try ideas.

In that framing, autoresearch starts to look like a research operating system. Ideas enter a queue. Workers pull an idea, turn it into a code change, run the experiment, read the metric, and keep only what improves the result. The output is not a paper or a chat transcript. The output is a sequence of commits that change training code and can be checked.

Karpathy then pushed the idea further: what if the workers are not trusted? What if they are just a large pool of people or agents on the internet contributing compute? In the transcript, he describes a system where "instead of blocks, you have commits" and where the hard work is searching through many experiments to find commits that actually improve the model. The missing piece is verification: an untrusted pool can search, but a trusted verification pool has to replay candidate commits and decide whether they are real.

Wei Dai connected that framing to proof-of-useful-work: can proof-of-work be built on top of autoresearch? The reason it is plausible is the same asymmetry:

Discovery is expensive. Verification is cheaper once a candidate exists.

A miner may spend many GPU hours trying failed ideas. But after a promising diff exists, a validator can replay the same experiment under controlled conditions and check whether the metric really improved.

Proof of Autoresearch is an attempt to flesh out that missing system. What is the claim object? What evidence has to be attached? How do commits build on each other? How does a verifier replay work safely? How should statistical noise be handled? How could the results become a public archive instead of a private experiment log?

AutoResearch-Chain and Bittensor are possible infrastructure for exploring those questions. ARC supplies a verification vocabulary: frozen surfaces, searchable surfaces, evidence hashes, commit-reveal, and replayable claims. Bittensor supplies a miner/validator network shape that can score useful work. This repo is the experiment of making that idea concrete enough to inspect.

The Core Thesis

Most proof-of-work systems spend compute to produce scarce but otherwise meaningless artifacts. AI research already spends compute to search for useful artifacts: better training recipes, optimizer settings, architecture tweaks, data handling changes, and reproducible metric improvements.

The thesis is that a network can treat a verified training improvement as the unit of useful work.

Role	What they do	What must be proven
Miner	Searches for a code diff that improves a frozen training task	The diff, config, logs, environment, and metrics are all captured
Validator	Replays the submitted experiment independently	The claimed improvement survives replay and statistical tests
Network	Rewards verified improvements	Weights are based on replayed evidence, not trust
Public	Receives the output	A reusable archive of verified research artifacts

If this works at scale, the output is not just emissions or leaderboard points. The output is an open corpus of training changes with evidence attached.

What This Repository Is

This repo is a public proof-of-concept for the protocol and subnet shape. The goal is to show the full logical path:

Define a training domain with frozen files, searchable files, a metric, and a baseline.
Let a miner propose and run a candidate training diff.
Bundle the evidence as content-addressed artifacts.
Verify a domain-bound commit-reveal proposal.
Replay the claimed improvement.
Score the result with statistical gates.
Feed the verified score into validator weights.

The mocked smoke test exercises that whole path locally without GPU, IPFS, Docker, Bittensor, paid model APIs, or cloud compute providers. The production-facing code paths exist, but real multi-machine and testnet validation remain future work.

What Has Been Built So Far

Area	Current state
Protocol model	`ResearchProposal` synapse with 16 fields, evidence hashes, replay fields, and domain-bound commit-reveal
Evidence model	Five BLAKE3 artifact hashes: diff, config, environment manifest, training log, and metrics output
Miner logic	LLM diff loop, training execution adapters, evidence bundling, proposal construction, and local/remote execution paths
Validator logic	Commit-reveal checks, evidence validation, replay hooks, score computation, and weight-update path
Reward model	8-stage scoring pipeline with replay delta, minimum-delta gate, paired bootstrap p-value, tolerance check, transfer bonus, ancestry decay, and clamping
Rust ARC CLI	SQLite-backed local store, domain/frontier/block helpers, commit/reveal tracking, and BLAKE3 content-addressed store utilities
Domains	`nanochat-val_bpb-v1` is the active PoC domain; `nanogpt-val_loss-v1` is present as a stub and not yet calibrated
Providers	Generic interfaces for local model servers, remote model APIs, local GPUs, and rented GPU workers; live provider validation is intentionally not claimed
Tests	647 Python tests and 13 Rust tests
Coverage	71% total over `arc_subnet/` and `neurons/`
CI/public repo prep	GitHub Actions workflow, issue templates, PR template, Dependabot config, security policy, citation file, examples, and release docs

What Is Not Proven Yet

This distinction matters. The project should be read as a research prototype, not as a finished subnet.

Not proven	Current boundary
Public subnet	No public Bittensor testnet subnet has been launched from this repo
Live weight setting	Validator permit and weight-setting behavior still need testnet validation
Real artifact network	IPFS upload/fetch needs real multi-machine rehearsal
Real validator replay	Docker replay needs a GPU validator rehearsal with hardened flags
Paid mining economics	A remote-model plus rented-GPU run needs documented cost, runtime, and sanitized logs
Provider breadth	Specific model and compute vendors need production wiring or live testing before being advertised as working
More domains	nanochat is the only active real domain today
Sustained discoveries	The project has not yet produced a sustained public run with statistically validated improvements beyond the calibrated baseline

See STATUS.md for the current release framing.

Limitations

This repository is us fleshing out the idea, not declaring the system solved. Karpathy's autoresearch work showed that the training-loop search process can be partially automated. This repo explores the surrounding protocol questions: what evidence should be captured, how claims could be replayed, how scoring might work, and what a public archive of verified improvements could look like.

Important limitations:

The smoke test is mocked. It proves the protocol shape, not real external operation.
The active domain is intentionally small. Success on a small training script does not imply useful discoveries on frontier-scale systems.
Reproducible GPU training is hard. Hardware variance, kernels, drivers, container images, and nondeterminism all need more work.
The economic model is speculative until real mining cost, replay cost, improvement frequency, and validator behavior are measured.
A live subnet could create bad incentives if scoring is weak, domains are too narrow, or validators cannot afford replay.
Provider integrations are abstractions at this stage. This README avoids treating any specific commercial model or GPU vendor as endorsed or proven.
The public-good claim depends on the archive being usable, reproducible, and open enough for researchers outside the network to benefit.

The point of publishing the PoC is to invite review of those assumptions.

How a Live Version Could Work

1. Public Research Domains

A domain defines the optimization task. For example:

seed codebase: a small training repository
frozen surface: files miners cannot modify, such as evaluation logic
search surface: files miners may modify, such as train.py
metric: a scalar target such as validation bits-per-byte
baseline: calibrated scores per hardware class

The domain is the public contract. Everyone knows what can change, what cannot change, and how success is measured.

2. Miner Search

A miner uses an LLM agent and GPU backend to search for improvements. A single iteration can look like this:

Pull the current frontier for a domain.
Ask an LLM to propose a training-code diff.
Apply the diff to the searchable surface.
Run training locally or on a rented GPU.
Parse the metric.
Keep promising results and discard failures.

Most attempts should fail. That is the point. The network pays for verified discoveries, not for every experiment someone runs.

3. Evidence Before Announcement

Before a miner announces a proposal, it bundles evidence:

the exact diff
experiment config
environment manifest
training log
metrics output

Each artifact is hashed with BLAKE3. In the intended live system, the artifacts are uploaded before the proposal is announced so validators can fetch and rehash them.

4. Commit-Reveal

The miner commits before revealing the full proposal:

sha3_256(diff_hash + salt + domain_id)

Binding the commit to domain_id prevents a valid improvement from one domain being replayed against another domain's baseline. This was one of the important runtime mismatches corrected during public-release preparation.

5. Validator Replay

Validators fetch evidence, rehash artifacts, rebuild the experiment, and replay the candidate under controlled conditions. The PoC scoring path uses K=5 replays and a paired bootstrap significance test.

A proposal is only useful if the replayed result supports the claim.

6. Scoring and Weights

The validator score is not just "metric went up." It passes through gates:

replay delta must be positive
improvement must clear the minimum-delta threshold
bootstrap p-value must be significant
claimed delta and replayed delta must be within tolerance
score is scaled with diminishing returns
transfer and ancestry logic can adjust reward
final score is clamped into [0, 1]

In a live subnet, validator weights would be set from those verified scores.

7. Public Research Archive

The long-term output should be a public archive of:

proposed diffs
evidence bundles
replay results
hardware class metadata
validator decisions
domain frontiers

That archive is what makes this more than a mining game.

Why This Could Become a Global Public Good

If the system works, it could create public infrastructure for AI research in the same way package registries created public infrastructure for software.

Reproducibility by Default

Every rewarded claim would need artifacts, hashes, logs, configs, and replay results. That pushes the culture toward evidence-first research instead of unverifiable claims.

Open Access to Training Improvements

Small improvements often disappear inside private runs. A public network could turn those improvements into open records: code diffs with measured effects. Researchers, students, labs, and independent builders could inspect and reuse them.

Better Use of Global Compute

There is already a huge amount of undercoordinated GPU experimentation. A verified market could direct some of that compute toward reusable public research artifacts instead of private trial-and-error.

A Bridge for Independent Researchers

Independent researchers often lack the credibility layer that institutions provide. Replay verification can make the evidence portable: the work stands or falls by whether validators can reproduce it.

Shared Benchmarks for Useful Work

The system could help define a more rigorous version of proof-of-useful-work: not "I spent compute," but "I spent compute and produced a verified, reusable improvement."

A Public Dataset of Failed and Successful Search

The first reward path focuses on successful improvements. A mature system could also preserve failed experiments, near misses, and negative results. That would help future agents avoid rediscovering the same dead ends.

Not a Replacement for Science

This would not replace deep theory, peer review, or human taste. It would be a research substrate: a way to generate, verify, and publish small empirical training improvements continuously.

Run the Local Proof

This runs the mocked end-to-end path. It needs no GPU, no IPFS daemon, no Docker, no Bittensor node, and no paid API keys.

git clone https://github.com/claudlos/proof-of-autoresearch.git
cd proof-of-autoresearch

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

python scripts/smoke_test.py

Expected result:

ALL CHECKS PASSED

See docs/DEMO.md for the full walkthrough and exactly what is mocked.

Validation State

Current local validation for the public proof-of-concept candidate:

Check	Status
Python package install	Passing
Ruff lint	Passing
Mypy	Passing on `arc_subnet/` and `neurons/`
Python tests	647 passing
Python coverage	71% total
Smoke test	7/7 mocked checks passing
Fresh release-copy install	Passing with a new venv from a temporary copy
Rust format	Passing
Rust build	Passing
Rust tests	13 passing
Rust clippy	Passing
Markdown local links	Passing
Example JSON	Passing
SVG XML	Passing
Public-candidate secret scan	No real secrets found

Validation commands:

python -m ruff check arc_subnet/ neurons/ tests/ scripts/
python -m mypy arc_subnet/ neurons/ --ignore-missing-imports
python scripts/smoke_test.py
python -m pytest tests/ -q --tb=short
python -m pytest tests/ --cov=arc_subnet --cov=neurons --cov-report=term-missing -q --tb=short

cargo fmt --all --check
cargo build --release --workspace
cargo test --workspace
cargo clippy --workspace -- -D warnings

See docs/VALIDATION.md for the validation report.

Generic Integration Shapes

These commands are for future integration work. They are not proof of public testnet or mainnet readiness, and the placeholder backend names are not meant to advertise any specific vendor.

Remote Model API plus Rented GPU Worker

REMOTE_MODEL_API_KEY="<remote-model-api-key>" \
GPU_RENTAL_API_KEY="<gpu-rental-api-key>" \
ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
  --netuid <NETUID> \
  --wallet.name miner \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --miner.llm_backend <remote-model-backend> \
  --miner.train_backend <cloud-gpu-backend> \
  --miner.cloud_gpu "<gpu-class>" \
  --miner.cloud_max_price 0.35

Local LLM plus Local GPU

ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
  --netuid <NETUID> \
  --wallet.name miner \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --miner.llm_backend local \
  --miner.llm_endpoint http://localhost:8000/v1 \
  --miner.train_backend local

Validator

ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py validator \
  --netuid <NETUID> \
  --wallet.name validator \
  --subtensor.chain_endpoint ws://127.0.0.1:9946 \
  --validator.timeout 1800 \
  --epoch_length 150

Repository Map

arc_subnet/
  protocol.py          ResearchProposal synapse and validation helpers
  reward.py            8-stage reward/scoring pipeline
  arc_client.py        Python wrapper around the Rust arc-cli
  llm_providers.py     LLM provider registry
  cloud_providers.py   GPU backend registry
  metrics.py           Prometheus metrics helpers

neurons/
  miner.py             Miner agent loop, training adapters, evidence bundling
  validator.py         Validator replay, scoring, and weight update logic
  production.py        Bittensor SDK wiring and CLI entry point
  domains/             Domain definitions and registry

crates/
  core/                Rust ARC store, SQLite DB, IPFS helpers, types
  cli/                 Rust arc-cli commands

docs/                  Public concept, demo, architecture, validation, FAQ
paper/                 Technical, economics, and verification papers
examples/              Proposal, evidence, and score JSON examples
assets/brand/          README banner, mark, and verification-flow diagram
scripts/               Smoke test and local helper scripts
tests/                 Python test suite

Roadmap

The short version:

Clarify the idea, limitations, and validation boundary.
Keep the mocked local proof easy to run.
Define the real-world experiments needed to test the idea.
Validate storage, replay, model, compute, and chain assumptions one at a time.
Add another calibrated domain only after the first one is understood.
Treat testnet planning as a later research milestone, not a launch promise.

See ROADMAP.md for the public roadmap.

Contributing

This project is best suited for builders who want to make the idea more real, not for users looking for a finished subnet.

Good first contributions:

run the fresh-clone rehearsal and report setup drift
validate one real integration path: artifact storage, Docker replay, rented GPU execution, remote model inference, or testnet weight-setting
compute and document frozen-surface hashes for the active nanochat domain
improve docs around adding a second real domain after nanochat
add tests before widening provider claims

The remaining integration gaps are listed in STATUS.md, ROADMAP.md, and .github/ISSUE_SEEDS.md. See CONTRIBUTING.md for development setup.

Security

This is a research prototype that executes training code and shells out to GPU, Docker, IPFS, and Bittensor tooling. Do not run it with production wallets, valuable hotkeys, or broad cloud permissions. See SECURITY.md for reporting guidance and operational cautions.

Acknowledgments

Andrej Karpathy - autoresearch and the No Priors discussion that framed autoresearch as workers searching for useful training-code commits.
Wei Dai - the proof-of-useful-work question that connected autoresearch to discovery/verification asymmetry.
Bittensor / Opentensor Foundation - the subnet and validator/miner incentive model this prototype targets.
AutoResearch-Chain (ARC) - the verification protocol that inspired the evidence and replay model.

License

MIT - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
arc_subnet		arc_subnet
assets/brand		assets/brand
crates		crates
docs		docs
examples		examples
neurons		neurons
paper		paper
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.replay		Dockerfile.replay
IMPROVEMENTS_CHECKLIST.md		IMPROVEMENTS_CHECKLIST.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
STATUS.md		STATUS.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Proof of Autoresearch

Origin of the Idea

The Core Thesis

What This Repository Is

What Has Been Built So Far

What Is Not Proven Yet

Limitations

How a Live Version Could Work

1. Public Research Domains

2. Miner Search

3. Evidence Before Announcement

4. Commit-Reveal

5. Validator Replay

6. Scoring and Weights

7. Public Research Archive

Why This Could Become a Global Public Good

Reproducibility by Default

Open Access to Training Improvements

Better Use of Global Compute

A Bridge for Independent Researchers

Shared Benchmarks for Useful Work

A Public Dataset of Failed and Successful Search

Not a Replacement for Science

Run the Local Proof

Validation State

Generic Integration Shapes

Remote Model API plus Rented GPU Worker

Local LLM plus Local GPU

Validator

Repository Map

Roadmap

Contributing

Security

Further Reading

Acknowledgments

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages