Proof of Autoresearch Architecture

Technical architecture document for the Proof of Autoresearch ARC-Bittensor proof of concept.

Implementation counts and live-readiness details can drift; see ../STATUS.md for the current release status.

System Overview

Proof of Autoresearch is an ARC-Bittensor proof of concept for a Bittensor subnet where miners discover training recipe improvements through LLM-guided autoresearch, and validators verify those improvements through deterministic replay and statistical testing.

Component Summary

Component	File	Lines	Purpose
Miner	neurons/miner.py	2,155	Autoresearch loop: LLM -> diff -> train -> evidence
Validator	neurons/validator.py	1,059	Docker replay, bootstrap test, scoring
Production	neurons/production.py	663	Bittensor v10 SDK wiring, CLI entry point
Protocol	arc_subnet/protocol.py	359	ResearchProposal synapse (16 fields, 64+ hw classes)
Reward	arc_subnet/reward.py	405	8-stage scoring pipeline
Cloud Providers	arc_subnet/cloud_providers.py	590	Abstraction for 4 GPU cloud providers
LLM Providers	arc_subnet/llm_providers.py	468	Abstraction for 8 LLM backends
ARC Client	arc_subnet/arc_client.py	408	Rust arc-cli bridge for ARC protocol DB
Metrics	arc_subnet/metrics.py	163	Prometheus /metrics endpoint
Domain Base	neurons/domains/base.py	139	DomainBase ABC + MetricSpec
Domain Registry	neurons/domains/registry.py	81	Auto-discovery domain registry
Nanochat Domain	neurons/domains/nanochat_v1.py	163	nanochat-val_bpb-v1 config + calibration
NanoGPT Domain	neurons/domains/nanogpt_v1.py	104	nanogpt-val_loss-v1 config (stub)
Rust CLI	crates/cli/	702	12 subcommands, BLAKE3 CAS
Rust Core	crates/core/	923	SQLite DB + blob store + IPFS

Data Flow

Mining Flow (Miner)

                          ┌──────────────────────┐
                          │   Model Backend      │
                          │   (local or remote;  │
                          │    registry broader) │
                          └──────────┬───────────┘
                                     │ 1. Generate diff
                                     ▼
┌─────────────┐    extract    ┌──────────────────┐    apply diff    ┌─────────────────┐
│ BLAKE3 CAS  │──────────────▶│  Seed Workspace  │────────────────▶│ Modified        │
│ (arc_store/)│               │  (nanochat/)     │                 │ Workspace       │
└─────────────┘               └──────────────────┘                 └────────┬────────┘
                                                                            │
                                                              2. Train      │
                                                                            ▼
                                                                   ┌────────────────┐
                                                                   │  GPU Training   │
                                                                   │  local/rented  │
                                                                   │  + registry    │
                                                                   └────────┬────────┘
                                                                            │
                                                              3. Parse      │
                                                                 metrics    │
                                                                            ▼
┌─────────────────┐   5. Submit   ┌──────────────────┐   4. Bundle   ┌──────────────┐
│  Bittensor      │◀──────────────│  ResearchProposal│◀──────────────│ Evidence     │
│  (Subtensor)    │   via axon    │  (16 fields)     │   BLAKE3 hash │ (diff, log,  │
└─────────────────┘               └──────────────────┘               │  metrics,    │
                                                                     │  config, env)│
                                                                     └──────────────┘

Validation Flow (Validator)

┌─────────────────┐   1. Query    ┌──────────────────┐
│  Validator       │──────────────▶│  Miner (axon)    │
│  (dendrite)      │◀──────────────│                  │
└────────┬─────────┘   proposal    └──────────────────┘
         │
         │ 2. Verify commit-reveal
         │    SHA3-256(diff_hash + salt + domain_id) == commit_hash
         │
         │ 3. Fetch & verify evidence
         │    IPFS multi-gateway -> BLAKE3 re-hash -> compare
         │
         ▼
┌────────────────────────────────────────────────────────┐
│  Docker Replay (K=5 runs)                              │
│                                                        │
│  docker run --rm --gpus all --network=none             │
│    --read-only --cap-drop=ALL --memory=16g             │
│    --pids-limit=512 --security-opt=no-new-privileges   │
│    -e CUBLAS_WORKSPACE_CONFIG=:4096:8                  │
│    arc-replay:latest timeout 420 uv run train.py       │
│                                                        │
│  Runs baseline (or uses cache) + improved version      │
│  Collects K pairs of (baseline_bpb, improved_bpb)      │
└────────────────────┬───────────────────────────────────┘
                     │
                     │ 4. Paired bootstrap significance test
                     │    N=10000 resamples, p < 0.05
                     │
                     ▼
┌────────────────────────────────────────────────────────┐
│  8-Stage Scoring Pipeline                              │
│                                                        │
│  1. Replay gate:     replay_delta > 0                  │
│  2. Min-delta gate:  replay_delta > 0.0005             │
│  3. Significance:    p-value < 0.05                    │
│  4. Tolerance:       claimed ~= replay (within 5%)     │
│  5. Base score:      sqrt(delta / SCORE_CEILING)       │
│  6. Transfer bonus:  up to 2x for larger-scale confirm │
│  7. Ancestry decay:  0.95^depth                        │
│  8. Clamp:           [0.0, 1.0]                        │
└────────────────────┬───────────────────────────────────┘
                     │
                     │ 5. Set weights on chain
                     ▼
              ┌──────────────┐
              │  Subtensor    │
              │  Yuma         │
              │  Consensus    │
              └──────────────┘

Full System Data Flow

┌──────────────────────────────────────────────────────────────────────┐
│                        Bittensor Subtensor                           │
│  (blockchain: registrations, weights, emissions, Yuma Consensus)     │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
               ┌────────────┴────────────┐
               │                         │
       ┌───────┴───────┐        ┌───────┴───────┐
       │   Validator    │        │    Miner       │
       │                │        │                │
       │ • dendrite     │─query──▶ • axon:8091    │
       │ • Docker replay│◀─resp──│ • LLM backend  │
       │ • bootstrap    │        │ • GPU backend  │
       │ • scoring      │        │ • auto-tuner   │
       │ • weight-set   │        │ • evidence     │
       │                │        │                │
       │ :9102/metrics  │        │ :9101/metrics  │
       └───────┬───────┘        └───────┬───────┘
               │                         │
               │                         │
       ┌───────┴───────┐        ┌───────┴───────┐
       │  IPFS          │        │  GPU Worker    │
       │  (evidence     │        │  local/rented  │
       │   artifacts)   │        │  registry      │
       │                │        │  wider         │
       └───────────────┘        └───────────────┘

Module Dependency Graph

neurons/production.py          -- CLI entry point
  ├── neurons/miner.py         -- ARCMiner class
  │     ├── arc_subnet/protocol.py       -- ResearchProposal synapse
  │     ├── arc_subnet/llm_providers.py  -- model backend abstraction
  │     ├── arc_subnet/cloud_providers.py -- GPU backend abstraction
  │     ├── arc_subnet/arc_client.py     -- ARC protocol bridge
  │     ├── arc_subnet/metrics.py        -- Prometheus metrics
  │     └── neurons/domains/registry.py  -- domain auto-discovery
  │           ├── neurons/domains/base.py          -- DomainBase ABC
  │           ├── neurons/domains/nanochat_v1.py   -- nanochat domain
  │           └── neurons/domains/nanogpt_v1.py    -- nanoGPT domain
  │
  └── neurons/validator.py     -- ARCValidator class
        ├── arc_subnet/protocol.py       -- (shared)
        ├── arc_subnet/reward.py         -- 8-stage scoring pipeline
        ├── arc_subnet/arc_client.py     -- (shared)
        ├── arc_subnet/metrics.py        -- (shared)
        └── neurons/domains/registry.py  -- (shared)

crates/cli/                    -- Rust arc-cli (standalone binary)
  └── crates/core/             -- BLAKE3 store, SQLite DB, IPFS client

Key Design Decisions

1. Deterministic Replay over Cryptographic Proofs

Decision: Verify improvements by replaying training inside Docker containers rather than using cryptographic proofs of compute.

Why: No widely accepted cryptographic proof of learning exists as of 2026. Deterministic replay with CUBLAS_WORKSPACE_CONFIG=:4096:8 provides reproducibility at 5-20% throughput cost. This is the same approach used by Bittensor SN9 and SN62.

Trade-off: Validators must have GPUs. This increases validator cost but provides ground-truth verification.

2. Paired Bootstrap Significance Testing

Decision: Use a paired bootstrap test (K=5, N=10000, p<0.05) instead of simpler threshold-based scoring.

Why: Training has inherent randomness. A single-run delta could be noise. The bootstrap test provides a principled way to distinguish signal from variance. The paired design (same seeds, baseline vs. improved) controls for run-to-run variance. Calibration on RTX 3050 Ti showed stdev=0.000180, validating that the test has sufficient power.

Trade-off: K=5 replays cost ~$0.11 per validation on a 4090. Lazy validation (caching baselines) reduces this to ~$0.055.

3. Frozen/Search Surface Separation

Decision: Explicitly define which files miners can and cannot modify.

Why: If miners could modify the evaluation code (prepare.py), they could game the metric. The frozen surface (data prep, evaluation) is hash-locked; validators reject any diff that touches frozen files. The search surface (train.py hyperparameters and architecture) is where improvements happen.

4. Three-Tier Diff Application

Decision: Apply diffs using clean git apply -> whitespace-fix git apply -> fuzzy line-matching fallback.

Why: LLMs frequently produce diffs with wrong line numbers or minor whitespace differences. The fuzzy fallback finds lines by stripped content and preserves original indentation, giving a ~95% success rate for LLM-generated diffs.

5. Commit-Reveal with Domain Binding

Decision: commit_hash = SHA3-256(diff_hash || salt || domain_id)

Why: Prevents front-running (commit before reveal) and cross-domain replay attacks (a valid nanochat improvement submitted against nanoGPT baseline). The domain_id binding was added after identifying this attack vector.

6. Early Stopping (Power-Law + Direct)

Decision: Kill experiments at 90s via power-law loss prediction and at 120s via direct measurement.

Why: ~60% of experiments are bad. Without early stopping, each takes 5 minutes. With it, bad experiments die in 90-120s, enabling ~3x more experiments per epoch. The power-law predictor fits L(t) = a*t^(-alpha) + b to early loss values and predicts within ~1-2% of final loss.

7. Pluggable Domain System with Auto-Discovery

Decision: Abstract domains behind DomainBase ABC with auto-discovery via registry.

Why: Different research tasks have different metrics, search surfaces, and baselines. The auto-discovery means adding a domain is just creating a new .py file in neurons/domains/. No registration boilerplate needed.

How to Add a New Domain

Create a new file in neurons/domains/, e.g., neurons/domains/my_task_v1.py
Implement the DomainBase interface:

from neurons.domains.base import DomainBase, MetricSpec

class MyTaskV1(DomainBase):
    domain_id = "mytask-metric_name-v1"

    metric = MetricSpec(
        name="metric_name",           # key in metrics.json
        direction="minimize",          # or "maximize"
        stdout_pattern=r"metric_name[=:](\d+\.?\d*)",  # regex to extract from stdout
    )

    seed_codebase_ref = "blake3hash..."  # BLAKE3 hash of seed tarball

    seed_scores = {
        "RTS-1-4090": 1.234,           # calibrated baseline per hardware class
    }

    frozen_surface = {
        "prepare.py": "blake3hash...", # files that must NOT change
    }

    search_surface = ["train.py"]      # files miners CAN modify

Stage the seed workspace:

# Package your training codebase
tar czf my_task_seed.tar.gz my_task/
# Store in BLAKE3 CAS
arc-cli store my_task_seed.tar.gz
# Note the hash -- use it as seed_codebase_ref

Calibrate baselines:

# Current helper calibrates the nanochat seed on a local GPU.
# New domains should add an equivalent domain-specific calibration helper.
python scripts/calibrate_local.py --batch-size 4 --runs 3

Compute frozen surface hashes:

# BLAKE3 hash each frozen file
arc-cli hash my_task/prepare.py

The domain will be auto-discovered by the registry on next startup.

How to Add a New LLM Provider

LLM providers live in arc_subnet/llm_providers.py. Each provider implements the same interface: take a prompt string, return a response string.

Add a new class in llm_providers.py:

class MyLLMBackend:
    def __init__(self, api_key: str, model: str = "default-model"):
        self.api_key = api_key
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4096) -> str:
        # Call your LLM API
        # Return the response text
        ...

LLM_BACKENDS["myllm"] = lambda: MyLLMBackend(
    api_key=os.environ.get("MYLLM_API_KEY", ""),
)

Use it:

python neurons/production.py miner --miner.llm_backend myllm

Provider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.

How to Add a New Cloud Provider

Cloud providers live in arc_subnet/cloud_providers.py. Each implements the GPU rental lifecycle: search -> create -> wait -> upload -> train -> download -> destroy.

Add a new class in cloud_providers.py:

class MyCloudProvider:
    def search_instances(self, gpu_model: str, max_price: float) -> list:
        """Search marketplace for available GPUs."""
        ...

    def create_instance(self, offer_id: str) -> str:
        """Rent an instance, return instance_id."""
        ...

    def wait_for_ssh(self, instance_id: str, timeout: int = 600) -> tuple:
        """Wait until SSH is ready, return (host, port)."""
        ...

    def run_training(self, instance_id: str, workspace: str) -> dict:
        """Upload workspace, run training, download results."""
        ...

    def destroy_instance(self, instance_id: str) -> None:
        """Tear down the instance."""
        ...

Register it in the provider factory.
Use it:

python neurons/production.py miner --miner.train_backend mycloud

Provider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.

Extension Points

Extension Point	Location	Description
New domain	`neurons/domains/*.py`	Add any ML training task
New LLM provider	`arc_subnet/llm_providers.py`	Add any LLM API
New cloud provider	`arc_subnet/cloud_providers.py`	Add any GPU rental service
New scoring gate	`arc_subnet/reward.py`	Add validation gates to scoring pipeline
New metric type	`neurons/domains/base.py` MetricSpec	Support maximize/minimize for any metric
Custom prompts	`neurons/miner.py` prompt construction	Modify LLM prompt strategy
Hardware classes	`arc_subnet/protocol.py`	Add GPU hardware class identifiers
ARC protocol	`arc_subnet/arc_client.py` + `crates/`	Extend Rust protocol layer
Monitoring	`arc_subnet/metrics.py`	Add Prometheus counters/gauges

Security Architecture

┌──────────────────────────────────────────────────┐
│  Validator Docker Replay Container                │
│                                                   │
│  --network=none         No internet access        │
│  --read-only            Immutable root FS         │
│  --tmpfs /tmp:2g        RAM-backed temp only      │
│  --cap-drop=ALL         No Linux capabilities     │
│  --no-new-privileges    No privilege escalation    │
│  --memory=16g           OOM protection             │
│  --pids-limit=512       Fork bomb protection       │
│  NVIDIA_DRIVER_CAPABILITIES=compute only          │
│  CUBLAS_WORKSPACE_CONFIG=:4096:8  Deterministic   │
│                                                   │
│  NVIDIA Container Toolkit >= 1.17.7               │
│  (CVE-2025-23266 patched — CVSS 9.0)             │
└──────────────────────────────────────────────────┘

Evidence integrity chain:

Miner trains -> BLAKE3 hash artifacts -> commit SHA3-256(diff+salt+domain)
  -> upload to IPFS -> reveal CIDs + salt
    -> validator fetches from IPFS (4-gateway fallback)
      -> re-hash with BLAKE3 -> compare against declared hashes
        -> reject on ANY mismatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proof of Autoresearch Architecture

System Overview

Component Summary

Data Flow

Mining Flow (Miner)

Validation Flow (Validator)

Full System Data Flow

Module Dependency Graph

Key Design Decisions

1. Deterministic Replay over Cryptographic Proofs

2. Paired Bootstrap Significance Testing

3. Frozen/Search Surface Separation

4. Three-Tier Diff Application

5. Commit-Reveal with Domain Binding

6. Early Stopping (Power-Law + Direct)

7. Pluggable Domain System with Auto-Discovery

How to Add a New Domain

How to Add a New LLM Provider

How to Add a New Cloud Provider

Extension Points

Security Architecture

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Proof of Autoresearch Architecture

System Overview

Component Summary

Data Flow

Mining Flow (Miner)

Validation Flow (Validator)

Full System Data Flow

Module Dependency Graph

Key Design Decisions

1. Deterministic Replay over Cryptographic Proofs

2. Paired Bootstrap Significance Testing

3. Frozen/Search Surface Separation

4. Three-Tier Diff Application

5. Commit-Reveal with Domain Binding

6. Early Stopping (Power-Law + Direct)

7. Pluggable Domain System with Auto-Discovery

How to Add a New Domain

How to Add a New LLM Provider

How to Add a New Cloud Provider

Extension Points

Security Architecture