Skip to content

Latest commit

 

History

History
417 lines (329 loc) · 20.6 KB

File metadata and controls

417 lines (329 loc) · 20.6 KB

Proof of Autoresearch Architecture

Technical architecture document for the Proof of Autoresearch ARC-Bittensor proof of concept.

Implementation counts and live-readiness details can drift; see ../STATUS.md for the current release status.


System Overview

Proof of Autoresearch is an ARC-Bittensor proof of concept for a Bittensor subnet where miners discover training recipe improvements through LLM-guided autoresearch, and validators verify those improvements through deterministic replay and statistical testing.

Component Summary

Component File Lines Purpose
Miner neurons/miner.py 2,155 Autoresearch loop: LLM -> diff -> train -> evidence
Validator neurons/validator.py 1,059 Docker replay, bootstrap test, scoring
Production neurons/production.py 663 Bittensor v10 SDK wiring, CLI entry point
Protocol arc_subnet/protocol.py 359 ResearchProposal synapse (16 fields, 64+ hw classes)
Reward arc_subnet/reward.py 405 8-stage scoring pipeline
Cloud Providers arc_subnet/cloud_providers.py 590 Abstraction for 4 GPU cloud providers
LLM Providers arc_subnet/llm_providers.py 468 Abstraction for 8 LLM backends
ARC Client arc_subnet/arc_client.py 408 Rust arc-cli bridge for ARC protocol DB
Metrics arc_subnet/metrics.py 163 Prometheus /metrics endpoint
Domain Base neurons/domains/base.py 139 DomainBase ABC + MetricSpec
Domain Registry neurons/domains/registry.py 81 Auto-discovery domain registry
Nanochat Domain neurons/domains/nanochat_v1.py 163 nanochat-val_bpb-v1 config + calibration
NanoGPT Domain neurons/domains/nanogpt_v1.py 104 nanogpt-val_loss-v1 config (stub)
Rust CLI crates/cli/ 702 12 subcommands, BLAKE3 CAS
Rust Core crates/core/ 923 SQLite DB + blob store + IPFS

Data Flow

Mining Flow (Miner)

                          ┌──────────────────────┐
                          │   Model Backend      │
                          │   (local or remote;  │
                          │    registry broader) │
                          └──────────┬───────────┘
                                     │ 1. Generate diff
                                     ▼
┌─────────────┐    extract    ┌──────────────────┐    apply diff    ┌─────────────────┐
│ BLAKE3 CAS  │──────────────▶│  Seed Workspace  │────────────────▶│ Modified        │
│ (arc_store/)│               │  (nanochat/)     │                 │ Workspace       │
└─────────────┘               └──────────────────┘                 └────────┬────────┘
                                                                            │
                                                              2. Train      │
                                                                            ▼
                                                                   ┌────────────────┐
                                                                   │  GPU Training   │
                                                                   │  local/rented  │
                                                                   │  + registry    │
                                                                   └────────┬────────┘
                                                                            │
                                                              3. Parse      │
                                                                 metrics    │
                                                                            ▼
┌─────────────────┐   5. Submit   ┌──────────────────┐   4. Bundle   ┌──────────────┐
│  Bittensor      │◀──────────────│  ResearchProposal│◀──────────────│ Evidence     │
│  (Subtensor)    │   via axon    │  (16 fields)     │   BLAKE3 hash │ (diff, log,  │
└─────────────────┘               └──────────────────┘               │  metrics,    │
                                                                     │  config, env)│
                                                                     └──────────────┘

Validation Flow (Validator)

┌─────────────────┐   1. Query    ┌──────────────────┐
│  Validator       │──────────────▶│  Miner (axon)    │
│  (dendrite)      │◀──────────────│                  │
└────────┬─────────┘   proposal    └──────────────────┘
         │
         │ 2. Verify commit-reveal
         │    SHA3-256(diff_hash + salt + domain_id) == commit_hash
         │
         │ 3. Fetch & verify evidence
         │    IPFS multi-gateway -> BLAKE3 re-hash -> compare
         │
         ▼
┌────────────────────────────────────────────────────────┐
│  Docker Replay (K=5 runs)                              │
│                                                        │
│  docker run --rm --gpus all --network=none             │
│    --read-only --cap-drop=ALL --memory=16g             │
│    --pids-limit=512 --security-opt=no-new-privileges   │
│    -e CUBLAS_WORKSPACE_CONFIG=:4096:8                  │
│    arc-replay:latest timeout 420 uv run train.py       │
│                                                        │
│  Runs baseline (or uses cache) + improved version      │
│  Collects K pairs of (baseline_bpb, improved_bpb)      │
└────────────────────┬───────────────────────────────────┘
                     │
                     │ 4. Paired bootstrap significance test
                     │    N=10000 resamples, p < 0.05
                     │
                     ▼
┌────────────────────────────────────────────────────────┐
│  8-Stage Scoring Pipeline                              │
│                                                        │
│  1. Replay gate:     replay_delta > 0                  │
│  2. Min-delta gate:  replay_delta > 0.0005             │
│  3. Significance:    p-value < 0.05                    │
│  4. Tolerance:       claimed ~= replay (within 5%)     │
│  5. Base score:      sqrt(delta / SCORE_CEILING)       │
│  6. Transfer bonus:  up to 2x for larger-scale confirm │
│  7. Ancestry decay:  0.95^depth                        │
│  8. Clamp:           [0.0, 1.0]                        │
└────────────────────┬───────────────────────────────────┘
                     │
                     │ 5. Set weights on chain
                     ▼
              ┌──────────────┐
              │  Subtensor    │
              │  Yuma         │
              │  Consensus    │
              └──────────────┘

Full System Data Flow

┌──────────────────────────────────────────────────────────────────────┐
│                        Bittensor Subtensor                           │
│  (blockchain: registrations, weights, emissions, Yuma Consensus)     │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
               ┌────────────┴────────────┐
               │                         │
       ┌───────┴───────┐        ┌───────┴───────┐
       │   Validator    │        │    Miner       │
       │                │        │                │
       │ • dendrite     │─query──▶ • axon:8091    │
       │ • Docker replay│◀─resp──│ • LLM backend  │
       │ • bootstrap    │        │ • GPU backend  │
       │ • scoring      │        │ • auto-tuner   │
       │ • weight-set   │        │ • evidence     │
       │                │        │                │
       │ :9102/metrics  │        │ :9101/metrics  │
       └───────┬───────┘        └───────┬───────┘
               │                         │
               │                         │
       ┌───────┴───────┐        ┌───────┴───────┐
       │  IPFS          │        │  GPU Worker    │
       │  (evidence     │        │  local/rented  │
       │   artifacts)   │        │  registry      │
       │                │        │  wider         │
       └───────────────┘        └───────────────┘

Module Dependency Graph

neurons/production.py          -- CLI entry point
  ├── neurons/miner.py         -- ARCMiner class
  │     ├── arc_subnet/protocol.py       -- ResearchProposal synapse
  │     ├── arc_subnet/llm_providers.py  -- model backend abstraction
  │     ├── arc_subnet/cloud_providers.py -- GPU backend abstraction
  │     ├── arc_subnet/arc_client.py     -- ARC protocol bridge
  │     ├── arc_subnet/metrics.py        -- Prometheus metrics
  │     └── neurons/domains/registry.py  -- domain auto-discovery
  │           ├── neurons/domains/base.py          -- DomainBase ABC
  │           ├── neurons/domains/nanochat_v1.py   -- nanochat domain
  │           └── neurons/domains/nanogpt_v1.py    -- nanoGPT domain
  │
  └── neurons/validator.py     -- ARCValidator class
        ├── arc_subnet/protocol.py       -- (shared)
        ├── arc_subnet/reward.py         -- 8-stage scoring pipeline
        ├── arc_subnet/arc_client.py     -- (shared)
        ├── arc_subnet/metrics.py        -- (shared)
        └── neurons/domains/registry.py  -- (shared)

crates/cli/                    -- Rust arc-cli (standalone binary)
  └── crates/core/             -- BLAKE3 store, SQLite DB, IPFS client

Key Design Decisions

1. Deterministic Replay over Cryptographic Proofs

Decision: Verify improvements by replaying training inside Docker containers rather than using cryptographic proofs of compute.

Why: No widely accepted cryptographic proof of learning exists as of 2026. Deterministic replay with CUBLAS_WORKSPACE_CONFIG=:4096:8 provides reproducibility at 5-20% throughput cost. This is the same approach used by Bittensor SN9 and SN62.

Trade-off: Validators must have GPUs. This increases validator cost but provides ground-truth verification.

2. Paired Bootstrap Significance Testing

Decision: Use a paired bootstrap test (K=5, N=10000, p<0.05) instead of simpler threshold-based scoring.

Why: Training has inherent randomness. A single-run delta could be noise. The bootstrap test provides a principled way to distinguish signal from variance. The paired design (same seeds, baseline vs. improved) controls for run-to-run variance. Calibration on RTX 3050 Ti showed stdev=0.000180, validating that the test has sufficient power.

Trade-off: K=5 replays cost ~$0.11 per validation on a 4090. Lazy validation (caching baselines) reduces this to ~$0.055.

3. Frozen/Search Surface Separation

Decision: Explicitly define which files miners can and cannot modify.

Why: If miners could modify the evaluation code (prepare.py), they could game the metric. The frozen surface (data prep, evaluation) is hash-locked; validators reject any diff that touches frozen files. The search surface (train.py hyperparameters and architecture) is where improvements happen.

4. Three-Tier Diff Application

Decision: Apply diffs using clean git apply -> whitespace-fix git apply -> fuzzy line-matching fallback.

Why: LLMs frequently produce diffs with wrong line numbers or minor whitespace differences. The fuzzy fallback finds lines by stripped content and preserves original indentation, giving a ~95% success rate for LLM-generated diffs.

5. Commit-Reveal with Domain Binding

Decision: commit_hash = SHA3-256(diff_hash || salt || domain_id)

Why: Prevents front-running (commit before reveal) and cross-domain replay attacks (a valid nanochat improvement submitted against nanoGPT baseline). The domain_id binding was added after identifying this attack vector.

6. Early Stopping (Power-Law + Direct)

Decision: Kill experiments at 90s via power-law loss prediction and at 120s via direct measurement.

Why: ~60% of experiments are bad. Without early stopping, each takes 5 minutes. With it, bad experiments die in 90-120s, enabling ~3x more experiments per epoch. The power-law predictor fits L(t) = a*t^(-alpha) + b to early loss values and predicts within ~1-2% of final loss.

7. Pluggable Domain System with Auto-Discovery

Decision: Abstract domains behind DomainBase ABC with auto-discovery via registry.

Why: Different research tasks have different metrics, search surfaces, and baselines. The auto-discovery means adding a domain is just creating a new .py file in neurons/domains/. No registration boilerplate needed.


How to Add a New Domain

  1. Create a new file in neurons/domains/, e.g., neurons/domains/my_task_v1.py

  2. Implement the DomainBase interface:

from neurons.domains.base import DomainBase, MetricSpec

class MyTaskV1(DomainBase):
    domain_id = "mytask-metric_name-v1"

    metric = MetricSpec(
        name="metric_name",           # key in metrics.json
        direction="minimize",          # or "maximize"
        stdout_pattern=r"metric_name[=:](\d+\.?\d*)",  # regex to extract from stdout
    )

    seed_codebase_ref = "blake3hash..."  # BLAKE3 hash of seed tarball

    seed_scores = {
        "RTS-1-4090": 1.234,           # calibrated baseline per hardware class
    }

    frozen_surface = {
        "prepare.py": "blake3hash...", # files that must NOT change
    }

    search_surface = ["train.py"]      # files miners CAN modify
  1. Stage the seed workspace:
# Package your training codebase
tar czf my_task_seed.tar.gz my_task/
# Store in BLAKE3 CAS
arc-cli store my_task_seed.tar.gz
# Note the hash -- use it as seed_codebase_ref
  1. Calibrate baselines:
# Current helper calibrates the nanochat seed on a local GPU.
# New domains should add an equivalent domain-specific calibration helper.
python scripts/calibrate_local.py --batch-size 4 --runs 3
  1. Compute frozen surface hashes:
# BLAKE3 hash each frozen file
arc-cli hash my_task/prepare.py

The domain will be auto-discovered by the registry on next startup.


How to Add a New LLM Provider

LLM providers live in arc_subnet/llm_providers.py. Each provider implements the same interface: take a prompt string, return a response string.

  1. Add a new class in llm_providers.py:
class MyLLMBackend:
    def __init__(self, api_key: str, model: str = "default-model"):
        self.api_key = api_key
        self.model = model

    def generate(self, prompt: str, max_tokens: int = 4096) -> str:
        # Call your LLM API
        # Return the response text
        ...
  1. Register it in the provider factory (in the same file):
LLM_BACKENDS["myllm"] = lambda: MyLLMBackend(
    api_key=os.environ.get("MYLLM_API_KEY", ""),
)
  1. Use it:
python neurons/production.py miner --miner.llm_backend myllm

Provider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.


How to Add a New Cloud Provider

Cloud providers live in arc_subnet/cloud_providers.py. Each implements the GPU rental lifecycle: search -> create -> wait -> upload -> train -> download -> destroy.

  1. Add a new class in cloud_providers.py:
class MyCloudProvider:
    def search_instances(self, gpu_model: str, max_price: float) -> list:
        """Search marketplace for available GPUs."""
        ...

    def create_instance(self, offer_id: str) -> str:
        """Rent an instance, return instance_id."""
        ...

    def wait_for_ssh(self, instance_id: str, timeout: int = 600) -> tuple:
        """Wait until SSH is ready, return (host, port)."""
        ...

    def run_training(self, instance_id: str, workspace: str) -> dict:
        """Upload workspace, run training, download results."""
        ...

    def destroy_instance(self, instance_id: str) -> None:
        """Tear down the instance."""
        ...
  1. Register it in the provider factory.

  2. Use it:

python neurons/production.py miner --miner.train_backend mycloud

Provider identifiers are implementation details until a backend has a public rehearsal report. Public docs should describe backend categories rather than advertising specific vendors.


Extension Points

Extension Point Location Description
New domain neurons/domains/*.py Add any ML training task
New LLM provider arc_subnet/llm_providers.py Add any LLM API
New cloud provider arc_subnet/cloud_providers.py Add any GPU rental service
New scoring gate arc_subnet/reward.py Add validation gates to scoring pipeline
New metric type neurons/domains/base.py MetricSpec Support maximize/minimize for any metric
Custom prompts neurons/miner.py prompt construction Modify LLM prompt strategy
Hardware classes arc_subnet/protocol.py Add GPU hardware class identifiers
ARC protocol arc_subnet/arc_client.py + crates/ Extend Rust protocol layer
Monitoring arc_subnet/metrics.py Add Prometheus counters/gauges

Security Architecture

┌──────────────────────────────────────────────────┐
│  Validator Docker Replay Container                │
│                                                   │
│  --network=none         No internet access        │
│  --read-only            Immutable root FS         │
│  --tmpfs /tmp:2g        RAM-backed temp only      │
│  --cap-drop=ALL         No Linux capabilities     │
│  --no-new-privileges    No privilege escalation    │
│  --memory=16g           OOM protection             │
│  --pids-limit=512       Fork bomb protection       │
│  NVIDIA_DRIVER_CAPABILITIES=compute only          │
│  CUBLAS_WORKSPACE_CONFIG=:4096:8  Deterministic   │
│                                                   │
│  NVIDIA Container Toolkit >= 1.17.7               │
│  (CVE-2025-23266 patched — CVSS 9.0)             │
└──────────────────────────────────────────────────┘

Evidence integrity chain:

Miner trains -> BLAKE3 hash artifacts -> commit SHA3-256(diff+salt+domain)
  -> upload to IPFS -> reveal CIDs + salt
    -> validator fetches from IPFS (4-gateway fallback)
      -> re-hash with BLAKE3 -> compare against declared hashes
        -> reject on ANY mismatch