Technical architecture document for the Proof of Autoresearch ARC-Bittensor proof of concept.
Implementation counts and live-readiness details can drift; see ../STATUS.md for the current release status.
Proof of Autoresearch is an ARC-Bittensor proof of concept for a Bittensor subnet where miners discover training recipe improvements through LLM-guided autoresearch, and validators verify those improvements through deterministic replay and statistical testing.
| Component | File | Lines | Purpose |
|---|---|---|---|
| Miner | neurons/miner.py | 2,155 | Autoresearch loop: LLM -> diff -> train -> evidence |
| Validator | neurons/validator.py | 1,059 | Docker replay, bootstrap test, scoring |
| Production | neurons/production.py | 663 | Bittensor v10 SDK wiring, CLI entry point |
| Protocol | arc_subnet/protocol.py | 359 | ResearchProposal synapse (16 fields, 64+ hw classes) |
| Reward | arc_subnet/reward.py | 405 | 8-stage scoring pipeline |
| Cloud Providers | arc_subnet/cloud_providers.py | 590 | Abstraction for 4 GPU cloud providers |
| LLM Providers | arc_subnet/llm_providers.py | 468 | Abstraction for 8 LLM backends |
| ARC Client | arc_subnet/arc_client.py | 408 | Rust arc-cli bridge for ARC protocol DB |
| Metrics | arc_subnet/metrics.py | 163 | Prometheus /metrics endpoint |
| Domain Base | neurons/domains/base.py | 139 | DomainBase ABC + MetricSpec |
| Domain Registry | neurons/domains/registry.py | 81 | Auto-discovery domain registry |
| Nanochat Domain | neurons/domains/nanochat_v1.py | 163 | nanochat-val_bpb-v1 config + calibration |
| NanoGPT Domain | neurons/domains/nanogpt_v1.py | 104 | nanogpt-val_loss-v1 config (stub) |
| Rust CLI | crates/cli/ | 702 | 12 subcommands, BLAKE3 CAS |
| Rust Core | crates/core/ | 923 | SQLite DB + blob store + IPFS |
┌──────────────────────┐
│ Model Backend │
│ (local or remote; │
│ registry broader) │
└──────────┬───────────┘
│ 1. Generate diff
▼
┌─────────────┐ extract ┌──────────────────┐ apply diff ┌─────────────────┐
│ BLAKE3 CAS │──────────────▶│ Seed Workspace │────────────────▶│ Modified │
│ (arc_store/)│ │ (nanochat/) │ │ Workspace │
└─────────────┘ └──────────────────┘ └────────┬────────┘
│
2. Train │
▼
┌────────────────┐
│ GPU Training │
│ local/rented │
│ + registry │
└────────┬────────┘
│
3. Parse │
metrics │
▼
┌─────────────────┐ 5. Submit ┌──────────────────┐ 4. Bundle ┌──────────────┐
│ Bittensor │◀──────────────│ ResearchProposal│◀──────────────│ Evidence │
│ (Subtensor) │ via axon │ (16 fields) │ BLAKE3 hash │ (diff, log, │
└─────────────────┘ └──────────────────┘ │ metrics, │
│ config, env)│
└──────────────┘
┌─────────────────┐ 1. Query ┌──────────────────┐
│ Validator │──────────────▶│ Miner (axon) │
│ (dendrite) │◀──────────────│ │
└────────┬─────────┘ proposal └──────────────────┘
│
│ 2. Verify commit-reveal
│ SHA3-256(diff_hash + salt + domain_id) == commit_hash
│
│ 3. Fetch & verify evidence
│ IPFS multi-gateway -> BLAKE3 re-hash -> compare
│
▼
┌────────────────────────────────────────────────────────┐
│ Docker Replay (K=5 runs) │
│ │
│ docker run --rm --gpus all --network=none │
│ --read-only --cap-drop=ALL --memory=16g │
│ --pids-limit=512 --security-opt=no-new-privileges │
│ -e CUBLAS_WORKSPACE_CONFIG=:4096:8 │
│ arc-replay:latest timeout 420 uv run train.py │
│ │
│ Runs baseline (or uses cache) + improved version │
│ Collects K pairs of (baseline_bpb, improved_bpb) │
└────────────────────┬───────────────────────────────────┘
│
│ 4. Paired bootstrap significance test
│ N=10000 resamples, p < 0.05
│
▼
┌────────────────────────────────────────────────────────┐
│ 8-Stage Scoring Pipeline │
│ │
│ 1. Replay gate: replay_delta > 0 │
│ 2. Min-delta gate: replay_delta > 0.0005 │
│ 3. Significance: p-value < 0.05 │
│ 4. Tolerance: claimed ~= replay (within 5%) │
│ 5. Base score: sqrt(delta / SCORE_CEILING) │
│ 6. Transfer bonus: up to 2x for larger-scale confirm │
│ 7. Ancestry decay: 0.95^depth │
│ 8. Clamp: [0.0, 1.0] │
└────────────────────┬───────────────────────────────────┘
│
│ 5. Set weights on chain
▼
┌──────────────┐
│ Subtensor │
│ Yuma │
│ Consensus │
└──────────────┘
┌──────────────────────────────────────────────────────────────────────┐
│ Bittensor Subtensor │
│ (blockchain: registrations, weights, emissions, Yuma Consensus) │
└───────────────────────────┬──────────────────────────────────────────┘
│
┌────────────┴────────────┐
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ Validator │ │ Miner │
│ │ │ │
│ • dendrite │─query──▶ • axon:8091 │
│ • Docker replay│◀─resp──│ • LLM backend │
│ • bootstrap │ │ • GPU backend │
│ • scoring │ │ • auto-tuner │
│ • weight-set │ │ • evidence │
│ │ │ │
│ :9102/metrics │ │ :9101/metrics │
└───────┬───────┘ └───────┬───────┘
│ │
│ │
┌───────┴───────┐ ┌───────┴───────┐
│ IPFS │ │ GPU Worker │
│ (evidence │ │ local/rented │
│ artifacts) │ │ registry │
│ │ │ wider │
└───────────────┘ └───────────────┘
neurons/production.py -- CLI entry point
├── neurons/miner.py -- ARCMiner class
│ ├── arc_subnet/protocol.py -- ResearchProposal synapse
│ ├── arc_subnet/llm_providers.py -- model backend abstraction
│ ├── arc_subnet/cloud_providers.py -- GPU backend abstraction
│ ├── arc_subnet/arc_client.py -- ARC protocol bridge
│ ├── arc_subnet/metrics.py -- Prometheus metrics
│ └── neurons/domains/registry.py -- domain auto-discovery
│ ├── neurons/domains/base.py -- DomainBase ABC
│ ├── neurons/domains/nanochat_v1.py -- nanochat domain
│ └── neurons/domains/nanogpt_v1.py -- nanoGPT domain
│
└── neurons/validator.py -- ARCValidator class
├── arc_subnet/protocol.py -- (shared)
├── arc_subnet/reward.py -- 8-stage scoring pipeline
├── arc_subnet/arc_client.py -- (shared)
├── arc_subnet/metrics.py -- (shared)
└── neurons/domains/registry.py -- (shared)
crates/cli/ -- Rust arc-cli (standalone binary)
└── crates/core/ -- BLAKE3 store, SQLite DB, IPFS client
Decision: Verify improvements by replaying training inside Docker containers rather than using cryptographic proofs of compute.
Why: No widely accepted cryptographic proof of learning exists as of 2026. Deterministic replay with CUBLAS_WORKSPACE_CONFIG=:4096:8 provides reproducibility at 5-20% throughput cost. This is the same approach used by Bittensor SN9 and SN62.
Trade-off: Validators must have GPUs. This increases validator cost but provides ground-truth verification.
Decision: Use a paired bootstrap test (K=5, N=10000, p<0.05) instead of simpler threshold-based scoring.
Why: Training has inherent randomness. A single-run delta could be noise. The bootstrap test provides a principled way to distinguish signal from variance. The paired design (same seeds, baseline vs. improved) controls for run-to-run variance. Calibration on RTX 3050 Ti showed stdev=0.000180, validating that the test has sufficient power.
Trade-off: K=5 replays cost ~$0.11 per validation on a 4090. Lazy validation (caching baselines) reduces this to ~$0.055.
Decision: Explicitly define which files miners can and cannot modify.
Why: If miners could modify the evaluation code (prepare.py), they could game the metric. The frozen surface (data prep, evaluation) is hash-locked; validators reject any diff that touches frozen files. The search surface (train.py hyperparameters and architecture) is where improvements happen.
Decision: Apply diffs using clean git apply -> whitespace-fix git apply -> fuzzy line-matching fallback.
Why: LLMs frequently produce diffs with wrong line numbers or minor whitespace differences. The fuzzy fallback finds lines by stripped content and preserves original indentation, giving a ~95% success rate for LLM-generated diffs.
Decision: commit_hash = SHA3-256(diff_hash || salt || domain_id)
Why: Prevents front-running (commit before reveal) and cross-domain replay attacks (a valid nanochat improvement submitted against nanoGPT baseline). The domain_id binding was added after identifying this attack vector.
Decision: Kill experiments at 90s via power-law loss prediction and at 120s via direct measurement.
Why: ~60% of experiments are bad. Without early stopping, each takes 5 minutes. With it, bad experiments die in 90-120s, enabling ~3x more experiments per epoch. The power-law predictor fits L(t) = a*t^(-alpha) + b to early loss values and predicts within ~1-2% of final loss.
Decision: Abstract domains behind DomainBase ABC with auto-discovery via registry.
Why: Different research tasks have different metrics, search surfaces, and baselines. The auto-discovery means adding a domain is just creating a new .py file in neurons/domains/. No registration boilerplate needed.
-
Create a new file in
neurons/domains/, e.g.,neurons/domains/my_task_v1.py -
Implement the
DomainBaseinterface:
from neurons.domains.base import DomainBase, MetricSpec
class MyTaskV1(DomainBase):
domain_id = "mytask-metric_name-v1"
metric = MetricSpec(
name="metric_name", # key in metrics.json
direction="minimize", # or "maximize"
stdout_pattern=r"metric_name[=:](\d+\.?\d*)", # regex to extract from stdout
)
seed_codebase_ref = "blake3hash..." # BLAKE3 hash of seed tarball
seed_scores = {
"RTS-1-4090": 1.234, # calibrated baseline per hardware class
}
frozen_surface = {
"prepare.py": "blake3hash...", # files that must NOT change
}
search_surface = ["train.py"] # files miners CAN modify- Stage the seed workspace:
# Package your training codebase
tar czf my_task_seed.tar.gz my_task/
# Store in BLAKE3 CAS
arc-cli store my_task_seed.tar.gz
# Note the hash -- use it as seed_codebase_ref- Calibrate baselines:
# Current helper calibrates the nanochat seed on a local GPU.
# New domains should add an equivalent domain-specific calibration helper.
python scripts/calibrate_local.py --batch-size 4 --runs 3- Compute frozen surface hashes:
# BLAKE3 hash each frozen file
arc-cli hash my_task/prepare.pyThe domain will be auto-discovered by the registry on next startup.
LLM providers live in arc_subnet/llm_providers.py. Each provider implements the same interface: take a prompt string, return a response string.
- Add a new class in
llm_providers.py:
class MyLLMBackend:
def __init__(self, api_key: str, model: str = "default-model"):
self.api_key = api_key
self.model = model
def generate(self, prompt: str, max_tokens: int = 4096) -> str:
# Call your LLM API
# Return the response text
...- Register it in the provider factory (in the same file):
LLM_BACKENDS["myllm"] = lambda: MyLLMBackend(
api_key=os.environ.get("MYLLM_API_KEY", ""),
)- Use it:
python neurons/production.py miner --miner.llm_backend myllmProvider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.
Cloud providers live in arc_subnet/cloud_providers.py. Each implements the GPU rental lifecycle: search -> create -> wait -> upload -> train -> download -> destroy.
- Add a new class in
cloud_providers.py:
class MyCloudProvider:
def search_instances(self, gpu_model: str, max_price: float) -> list:
"""Search marketplace for available GPUs."""
...
def create_instance(self, offer_id: str) -> str:
"""Rent an instance, return instance_id."""
...
def wait_for_ssh(self, instance_id: str, timeout: int = 600) -> tuple:
"""Wait until SSH is ready, return (host, port)."""
...
def run_training(self, instance_id: str, workspace: str) -> dict:
"""Upload workspace, run training, download results."""
...
def destroy_instance(self, instance_id: str) -> None:
"""Tear down the instance."""
...-
Register it in the provider factory.
-
Use it:
python neurons/production.py miner --miner.train_backend mycloudProvider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.
| Extension Point | Location | Description |
|---|---|---|
| New domain | neurons/domains/*.py |
Add any ML training task |
| New LLM provider | arc_subnet/llm_providers.py |
Add any LLM API |
| New cloud provider | arc_subnet/cloud_providers.py |
Add any GPU rental service |
| New scoring gate | arc_subnet/reward.py |
Add validation gates to scoring pipeline |
| New metric type | neurons/domains/base.py MetricSpec |
Support maximize/minimize for any metric |
| Custom prompts | neurons/miner.py prompt construction |
Modify LLM prompt strategy |
| Hardware classes | arc_subnet/protocol.py |
Add GPU hardware class identifiers |
| ARC protocol | arc_subnet/arc_client.py + crates/ |
Extend Rust protocol layer |
| Monitoring | arc_subnet/metrics.py |
Add Prometheus counters/gauges |
┌──────────────────────────────────────────────────┐
│ Validator Docker Replay Container │
│ │
│ --network=none No internet access │
│ --read-only Immutable root FS │
│ --tmpfs /tmp:2g RAM-backed temp only │
│ --cap-drop=ALL No Linux capabilities │
│ --no-new-privileges No privilege escalation │
│ --memory=16g OOM protection │
│ --pids-limit=512 Fork bomb protection │
│ NVIDIA_DRIVER_CAPABILITIES=compute only │
│ CUBLAS_WORKSPACE_CONFIG=:4096:8 Deterministic │
│ │
│ NVIDIA Container Toolkit >= 1.17.7 │
│ (CVE-2025-23266 patched — CVSS 9.0) │
└──────────────────────────────────────────────────┘
Evidence integrity chain:
Miner trains -> BLAKE3 hash artifacts -> commit SHA3-256(diff+salt+domain)
-> upload to IPFS -> reveal CIDs + salt
-> validator fetches from IPFS (4-gateway fallback)
-> re-hash with BLAKE3 -> compare against declared hashes
-> reject on ANY mismatch