Proof of Autoresearch is a proof-of-concept for verified AI research mining.
The project explores a simple question:
Can a decentralized network reward people for discovering real improvements to AI training, while forcing every claim to be replayable and public?
This repository combines LLM-driven autoresearch, GPU experiment execution, content-addressed evidence bundles, statistical replay verification, and Bittensor-style miner/validator incentives. It is not a live subnet, not a mainnet-ready system, and not a profitability claim. It is a working research prototype intended to make the idea concrete enough for builders to inspect, run locally, critique, and extend.
Modern AI research has a hard coordination problem. Useful training discoveries are expensive to find, easy to lose, and often hard to verify after the fact. Labs run thousands of private experiments, publish only a small fraction of the results, and leave the rest as invisible institutional memory. Independent researchers can have strong ideas but limited access to compute, incentives, and credible ways to prove that their results are real.
The immediate origin of this project is Karpathy's discussion of autoresearch in the No Priors interview on code agents and the "loopy era" of AI. His point was not just that an LLM can write code. It was that parts of research can be refactored so the human is no longer the step-by-step bottleneck: define an objective, define the metric, define the boundaries of what can change, then let workers try ideas.
In that framing, autoresearch starts to look like a research operating system. Ideas enter a queue. Workers pull an idea, turn it into a code change, run the experiment, read the metric, and keep only what improves the result. The output is not a paper or a chat transcript. The output is a sequence of commits that change training code and can be checked.
Karpathy then pushed the idea further: what if the workers are not trusted? What if they are just a large pool of people or agents on the internet contributing compute? In the transcript, he describes a system where "instead of blocks, you have commits" and where the hard work is searching through many experiments to find commits that actually improve the model. The missing piece is verification: an untrusted pool can search, but a trusted verification pool has to replay candidate commits and decide whether they are real.
Wei Dai connected that framing to proof-of-useful-work: can proof-of-work be built on top of autoresearch? The reason it is plausible is the same asymmetry:
Discovery is expensive. Verification is cheaper once a candidate exists.
A miner may spend many GPU hours trying failed ideas. But after a promising diff exists, a validator can replay the same experiment under controlled conditions and check whether the metric really improved.
Proof of Autoresearch is an attempt to flesh out that missing system. What is the claim object? What evidence has to be attached? How do commits build on each other? How does a verifier replay work safely? How should statistical noise be handled? How could the results become a public archive instead of a private experiment log?
AutoResearch-Chain and Bittensor are possible infrastructure for exploring those questions. ARC supplies a verification vocabulary: frozen surfaces, searchable surfaces, evidence hashes, commit-reveal, and replayable claims. Bittensor supplies a miner/validator network shape that can score useful work. This repo is the experiment of making that idea concrete enough to inspect.
Most proof-of-work systems spend compute to produce scarce but otherwise meaningless artifacts. AI research already spends compute to search for useful artifacts: better training recipes, optimizer settings, architecture tweaks, data handling changes, and reproducible metric improvements.
The thesis is that a network can treat a verified training improvement as the unit of useful work.
| Role | What they do | What must be proven |
|---|---|---|
| Miner | Searches for a code diff that improves a frozen training task | The diff, config, logs, environment, and metrics are all captured |
| Validator | Replays the submitted experiment independently | The claimed improvement survives replay and statistical tests |
| Network | Rewards verified improvements | Weights are based on replayed evidence, not trust |
| Public | Receives the output | A reusable archive of verified research artifacts |
If this works at scale, the output is not just emissions or leaderboard points. The output is an open corpus of training changes with evidence attached.
This repo is a public proof-of-concept for the protocol and subnet shape. The goal is to show the full logical path:
- Define a training domain with frozen files, searchable files, a metric, and a baseline.
- Let a miner propose and run a candidate training diff.
- Bundle the evidence as content-addressed artifacts.
- Verify a domain-bound commit-reveal proposal.
- Replay the claimed improvement.
- Score the result with statistical gates.
- Feed the verified score into validator weights.
The mocked smoke test exercises that whole path locally without GPU, IPFS, Docker, Bittensor, paid model APIs, or cloud compute providers. The production-facing code paths exist, but real multi-machine and testnet validation remain future work.
| Area | Current state |
|---|---|
| Protocol model | ResearchProposal synapse with 16 fields, evidence hashes, replay fields, and domain-bound commit-reveal |
| Evidence model | Five BLAKE3 artifact hashes: diff, config, environment manifest, training log, and metrics output |
| Miner logic | LLM diff loop, training execution adapters, evidence bundling, proposal construction, and local/remote execution paths |
| Validator logic | Commit-reveal checks, evidence validation, replay hooks, score computation, and weight-update path |
| Reward model | 8-stage scoring pipeline with replay delta, minimum-delta gate, paired bootstrap p-value, tolerance check, transfer bonus, ancestry decay, and clamping |
| Rust ARC CLI | SQLite-backed local store, domain/frontier/block helpers, commit/reveal tracking, and BLAKE3 content-addressed store utilities |
| Domains | nanochat-val_bpb-v1 is the active PoC domain; nanogpt-val_loss-v1 is present as a stub and not yet calibrated |
| Providers | Generic interfaces for local model servers, remote model APIs, local GPUs, and rented GPU workers; live provider validation is intentionally not claimed |
| Tests | 647 Python tests and 13 Rust tests |
| Coverage | 71% total over arc_subnet/ and neurons/ |
| CI/public repo prep | GitHub Actions workflow, issue templates, PR template, Dependabot config, security policy, citation file, examples, and release docs |
This distinction matters. The project should be read as a research prototype, not as a finished subnet.
| Not proven | Current boundary |
|---|---|
| Public subnet | No public Bittensor testnet subnet has been launched from this repo |
| Live weight setting | Validator permit and weight-setting behavior still need testnet validation |
| Real artifact network | IPFS upload/fetch needs real multi-machine rehearsal |
| Real validator replay | Docker replay needs a GPU validator rehearsal with hardened flags |
| Paid mining economics | A remote-model plus rented-GPU run needs documented cost, runtime, and sanitized logs |
| Provider breadth | Specific model and compute vendors need production wiring or live testing before being advertised as working |
| More domains | nanochat is the only active real domain today |
| Sustained discoveries | The project has not yet produced a sustained public run with statistically validated improvements beyond the calibrated baseline |
See STATUS.md for the current release framing.
This repository is us fleshing out the idea, not declaring the system solved. Karpathy's autoresearch work showed that the training-loop search process can be partially automated. This repo explores the surrounding protocol questions: what evidence should be captured, how claims could be replayed, how scoring might work, and what a public archive of verified improvements could look like.
Important limitations:
- The smoke test is mocked. It proves the protocol shape, not real external operation.
- The active domain is intentionally small. Success on a small training script does not imply useful discoveries on frontier-scale systems.
- Reproducible GPU training is hard. Hardware variance, kernels, drivers, container images, and nondeterminism all need more work.
- The economic model is speculative until real mining cost, replay cost, improvement frequency, and validator behavior are measured.
- A live subnet could create bad incentives if scoring is weak, domains are too narrow, or validators cannot afford replay.
- Provider integrations are abstractions at this stage. This README avoids treating any specific commercial model or GPU vendor as endorsed or proven.
- The public-good claim depends on the archive being usable, reproducible, and open enough for researchers outside the network to benefit.
The point of publishing the PoC is to invite review of those assumptions.
A domain defines the optimization task. For example:
- seed codebase: a small training repository
- frozen surface: files miners cannot modify, such as evaluation logic
- search surface: files miners may modify, such as
train.py - metric: a scalar target such as validation bits-per-byte
- baseline: calibrated scores per hardware class
The domain is the public contract. Everyone knows what can change, what cannot change, and how success is measured.
A miner uses an LLM agent and GPU backend to search for improvements. A single iteration can look like this:
- Pull the current frontier for a domain.
- Ask an LLM to propose a training-code diff.
- Apply the diff to the searchable surface.
- Run training locally or on a rented GPU.
- Parse the metric.
- Keep promising results and discard failures.
Most attempts should fail. That is the point. The network pays for verified discoveries, not for every experiment someone runs.
Before a miner announces a proposal, it bundles evidence:
- the exact diff
- experiment config
- environment manifest
- training log
- metrics output
Each artifact is hashed with BLAKE3. In the intended live system, the artifacts are uploaded before the proposal is announced so validators can fetch and rehash them.
The miner commits before revealing the full proposal:
sha3_256(diff_hash + salt + domain_id)
Binding the commit to domain_id prevents a valid improvement from one domain
being replayed against another domain's baseline. This was one of the important
runtime mismatches corrected during public-release preparation.
Validators fetch evidence, rehash artifacts, rebuild the experiment, and replay the candidate under controlled conditions. The PoC scoring path uses K=5 replays and a paired bootstrap significance test.
A proposal is only useful if the replayed result supports the claim.
The validator score is not just "metric went up." It passes through gates:
- replay delta must be positive
- improvement must clear the minimum-delta threshold
- bootstrap p-value must be significant
- claimed delta and replayed delta must be within tolerance
- score is scaled with diminishing returns
- transfer and ancestry logic can adjust reward
- final score is clamped into
[0, 1]
In a live subnet, validator weights would be set from those verified scores.
The long-term output should be a public archive of:
- proposed diffs
- evidence bundles
- replay results
- hardware class metadata
- validator decisions
- domain frontiers
That archive is what makes this more than a mining game.
If the system works, it could create public infrastructure for AI research in the same way package registries created public infrastructure for software.
Every rewarded claim would need artifacts, hashes, logs, configs, and replay results. That pushes the culture toward evidence-first research instead of unverifiable claims.
Small improvements often disappear inside private runs. A public network could turn those improvements into open records: code diffs with measured effects. Researchers, students, labs, and independent builders could inspect and reuse them.
There is already a huge amount of undercoordinated GPU experimentation. A verified market could direct some of that compute toward reusable public research artifacts instead of private trial-and-error.
Independent researchers often lack the credibility layer that institutions provide. Replay verification can make the evidence portable: the work stands or falls by whether validators can reproduce it.
The system could help define a more rigorous version of proof-of-useful-work: not "I spent compute," but "I spent compute and produced a verified, reusable improvement."
The first reward path focuses on successful improvements. A mature system could also preserve failed experiments, near misses, and negative results. That would help future agents avoid rediscovering the same dead ends.
This would not replace deep theory, peer review, or human taste. It would be a research substrate: a way to generate, verify, and publish small empirical training improvements continuously.
This runs the mocked end-to-end path. It needs no GPU, no IPFS daemon, no Docker, no Bittensor node, and no paid API keys.
git clone https://github.com/claudlos/proof-of-autoresearch.git
cd proof-of-autoresearch
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
python scripts/smoke_test.pyExpected result:
ALL CHECKS PASSED
See docs/DEMO.md for the full walkthrough and exactly what is mocked.
Current local validation for the public proof-of-concept candidate:
| Check | Status |
|---|---|
| Python package install | Passing |
| Ruff lint | Passing |
| Mypy | Passing on arc_subnet/ and neurons/ |
| Python tests | 647 passing |
| Python coverage | 71% total |
| Smoke test | 7/7 mocked checks passing |
| Fresh release-copy install | Passing with a new venv from a temporary copy |
| Rust format | Passing |
| Rust build | Passing |
| Rust tests | 13 passing |
| Rust clippy | Passing |
| Markdown local links | Passing |
| Example JSON | Passing |
| SVG XML | Passing |
| Public-candidate secret scan | No real secrets found |
Validation commands:
python -m ruff check arc_subnet/ neurons/ tests/ scripts/
python -m mypy arc_subnet/ neurons/ --ignore-missing-imports
python scripts/smoke_test.py
python -m pytest tests/ -q --tb=short
python -m pytest tests/ --cov=arc_subnet --cov=neurons --cov-report=term-missing -q --tb=short
cargo fmt --all --check
cargo build --release --workspace
cargo test --workspace
cargo clippy --workspace -- -D warningsSee docs/VALIDATION.md for the validation report.
These commands are for future integration work. They are not proof of public testnet or mainnet readiness, and the placeholder backend names are not meant to advertise any specific vendor.
REMOTE_MODEL_API_KEY="<remote-model-api-key>" \
GPU_RENTAL_API_KEY="<gpu-rental-api-key>" \
ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
--netuid <NETUID> \
--wallet.name miner \
--subtensor.chain_endpoint ws://127.0.0.1:9946 \
--miner.llm_backend <remote-model-backend> \
--miner.train_backend <cloud-gpu-backend> \
--miner.cloud_gpu "<gpu-class>" \
--miner.cloud_max_price 0.35ARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py miner \
--netuid <NETUID> \
--wallet.name miner \
--subtensor.chain_endpoint ws://127.0.0.1:9946 \
--miner.llm_backend local \
--miner.llm_endpoint http://localhost:8000/v1 \
--miner.train_backend localARC_STORE_DIR="$(pwd)/arc_store" \
python neurons/production.py validator \
--netuid <NETUID> \
--wallet.name validator \
--subtensor.chain_endpoint ws://127.0.0.1:9946 \
--validator.timeout 1800 \
--epoch_length 150arc_subnet/
protocol.py ResearchProposal synapse and validation helpers
reward.py 8-stage reward/scoring pipeline
arc_client.py Python wrapper around the Rust arc-cli
llm_providers.py LLM provider registry
cloud_providers.py GPU backend registry
metrics.py Prometheus metrics helpers
neurons/
miner.py Miner agent loop, training adapters, evidence bundling
validator.py Validator replay, scoring, and weight update logic
production.py Bittensor SDK wiring and CLI entry point
domains/ Domain definitions and registry
crates/
core/ Rust ARC store, SQLite DB, IPFS helpers, types
cli/ Rust arc-cli commands
docs/ Public concept, demo, architecture, validation, FAQ
paper/ Technical, economics, and verification papers
examples/ Proposal, evidence, and score JSON examples
assets/brand/ README banner, mark, and verification-flow diagram
scripts/ Smoke test and local helper scripts
tests/ Python test suite
The short version:
- Clarify the idea, limitations, and validation boundary.
- Keep the mocked local proof easy to run.
- Define the real-world experiments needed to test the idea.
- Validate storage, replay, model, compute, and chain assumptions one at a time.
- Add another calibrated domain only after the first one is understood.
- Treat testnet planning as a later research milestone, not a launch promise.
See ROADMAP.md for the public roadmap.
This project is best suited for builders who want to make the idea more real, not for users looking for a finished subnet.
Good first contributions:
- run the fresh-clone rehearsal and report setup drift
- validate one real integration path: artifact storage, Docker replay, rented GPU execution, remote model inference, or testnet weight-setting
- compute and document frozen-surface hashes for the active nanochat domain
- improve docs around adding a second real domain after nanochat
- add tests before widening provider claims
The remaining integration gaps are listed in STATUS.md, ROADMAP.md, and .github/ISSUE_SEEDS.md. See CONTRIBUTING.md for development setup.
This is a research prototype that executes training code and shells out to GPU, Docker, IPFS, and Bittensor tooling. Do not run it with production wallets, valuable hotkeys, or broad cloud permissions. See SECURITY.md for reporting guidance and operational cautions.
- Demo - mocked smoke-test walkthrough
- Concept - short explanation of the verification asymmetry
- Vision - why the idea could matter as public research infrastructure
- Architecture - system components and extension notes
- Validation Report - commands and current results
- Status - current proof-of-concept state and known gaps
- Roadmap - public path for testing the idea beyond the PoC
- Examples - proposal, evidence bundle, and validator score shapes
- Resource Guide - related projects, papers, and tools
- Technical Paper - detailed implementation paper
- Economics Paper - cost model and incentive discussion
- Verification Protocol - ARC replay protocol details
- Andrej Karpathy - autoresearch and the No Priors discussion that framed autoresearch as workers searching for useful training-code commits.
- Wei Dai - the proof-of-useful-work question that connected autoresearch to discovery/verification asymmetry.
- Bittensor / Opentensor Foundation - the subnet and validator/miner incentive model this prototype targets.
- AutoResearch-Chain (ARC) - the verification protocol that inspired the evidence and replay model.
MIT - see LICENSE.
