This directory contains the complete architecture for integrating MetaHarness Darwin Mode with RuVector's benchmark suite, enabling autonomous parameter optimization against public leaderboards (ANN-Benchmarks, BEIR, VectorDBBench, MTEB).
File: docs/METAHARNESS-ARCHITECTURE-SUMMARY.md
Length: 500 lines, ~20 min read
What it covers: Entire project overview, 3 ADRs, 5 phases, effort estimate, success criteria
File: docs/adr/ADR-265-ruvector-comprehensive-benchmark-suite.md
Length: 280 lines
What it covers:
- What we measure (5 categories: ANN, compression, latency, streaming, embedding quality)
- How we score configs (4-component function: recall, QPS, memory, latency)
- Baseline anchors and mutable surfaces
- Why these datasets (SIFT1M, GIST1M, GloVe, BEIR, MTEB)
File: docs/adr/ADR-266-metaharness-darwin-integration.md
Length: 350 lines
What it covers:
- How Darwin Mode evolves configs (32 mutation surfaces, genetic algorithm)
- ADR-150 compliance (graceful degradation if MetaHarness missing)
- Scoring policy implementation (TypeScript code)
- Evolution loop with checkpoint strategy
- CI/CD workflow (weekly evolution runs)
File: docs/adr/ADR-267-sota-validation-protocol.md
Length: 400 lines
What it covers:
- 3-tier validation (Tier 1: daily smoke, Tier 2: weekly full, Tier 3: publication audit)
- Witness signing with Ed25519 (cryptographic audit trails)
- Regression detection and SOTA claim rules
- File structure for manifests and replications
File: docs/metaharness-implementation-plan.md
Length: 500 lines, detailed code sketches and CI/CD configs
What it covers:
- All 5 phases with deliverables and success gates
- File structure (21 TypeScript, 3 Rust files)
- Effort breakdown (16 weeks, 8 agents)
- Rollout timeline (June 21 - Oct 11, 2026)
- Risk mitigation
Phase 1 (4w): ANN-Benchmarks loader + smoke test
Phase 2 (3w): Parameter sweep + Pareto frontier
Phase 3 (4w): BEIR + VectorDBBench integration
Phase 4 (3w): Darwin Mode evolution loop
Phase 5 (2w): MTEB embedding quality validation
─────────────────────────────────────────────
Total: 16 weeks, ~12K LOC
score = 0.4 * recall@10_norm
+ 0.3 * log(QPS/baseline_QPS)
+ 0.2 * (1 - min(1, memory/baseline_memory))
+ 0.1 * (1 - min(1, p99_ms/baseline_p99_ms))
All integration respects 4 invariants:
- ✅ Removable:
npm ls --without-deps @metaharness/*still works - ✅ Optional: Only in
optionalDependencies+peerDependencies - ✅ Graceful degradation: Every Darwin call wrapped in try-catch → fallback to grid search
- ✅ CI gate: Daily smoke test runs WITHOUT MetaHarness
Example graceful degradation:
async function initDarwinMode() {
try {
return await import("@metaharness/darwin");
} catch (e) {
if (e.code === "MODULE_NOT_FOUND") {
console.warn("[darwin] MetaHarness missing, using grid search");
return null;
}
throw e;
}
}ruvector/
├── docs/adr/
│ ├── ADR-265-ruvector-comprehensive-benchmark-suite.md (280 lines)
│ ├── ADR-266-metaharness-darwin-integration.md (350 lines)
│ └── ADR-267-sota-validation-protocol.md (400 lines)
│
├── docs/metaharness-implementation-plan.md (500 lines)
├── docs/METAHARNESS-ARCHITECTURE-SUMMARY.md (500 lines)
├── METAHARNESS-README.md (this file)
│
├── scripts/benchmark/ (21 TypeScript files, ~7.5K LOC)
│ ├── ann-datasets.ts (400 lines)
│ ├── single-dataset-harness.ts (600 lines)
│ ├── sweep-harness.ts (800 lines)
│ ├── darwin-harness.ts (600 lines)
│ ├── beir-loader.ts (500 lines)
│ ├── retrieval-harness.ts (700 lines)
│ ├── mteb-harness.ts (400 lines)
│ └── ... 14 more files
│
├── crates/ruvector-bench/
│ └── src/
│ ├── hdf5_loader.rs (350 lines)
│ ├── grid_search.rs (500 lines)
│ └── retrieval.rs (600 lines)
│
├── .github/workflows/
│ ├── benchmark-smoke.yml (daily)
│ ├── benchmark-sweep.yml (weekly)
│ ├── benchmark-beir.yml (weekly)
│ └── darwin-evolution.yml (weekly)
│
└── docs/validation/
├── smoke-baseline-2026-06.json (baseline)
├── manifests/ (signed per-release)
├── tier3-replications/ (publication audits)
├── witness-public-key.pem (Ed25519)
└── witness-manifest-index.json
Phase 1: SIFT1M in <30s, smoke test ±1% accuracy Phase 2: 10-15 Pareto configs, grid sweep <2h Phase 3: BEIR NDCG@10 ≥0.45 on NQ, VectorDBBench 5K QPS Phase 4: Darwin evolves 3+ metric improvement Phase 5: MTEB <10h, all-MiniLM ≥0.45 NDCG@10
Post-MVP: Signed Tier 3 manifests, ANN-Benchmarks submission
- SIFT1M: Industry standard, well-understood
- BEIR: Retrieval ground truth, 11 diverse datasets
- MTEB: Embedding quality, 170K sentences
- Not specialized leaderboards: Maintain reproducibility
- Manual grid search is O(n^k) in parameter space
- Darwin intelligently samples via genetic algorithm + simulated annealing
- Expected: beat baseline on 3+ metrics in 10 generations (~20 hours)
- SOTA claims need cryptographic proof (tamper-evidence)
- Enables third-party verification
- Required for publication credibility
- This week: Review & approve 3 ADRs
- Next 4 weeks: Phase 1 (HDF5 loader, smoke test)
- Ongoing: Weekly sync on completion, ADR-150 compliance audit
- MetaHarness Architect: Claude Code
- Phase 1 Lead: (TBD)
- Darwin Integration Lead: (TBD)
- Validation Protocol Lead: (TBD)
- ADR-150: MetaHarness Integration Surfaces (upstream)
- ADR-103: Witness Chain (upstream)
- ADR-128: SOTA Gap Implementations (context)
- ANN-Benchmarks: https://github.com/erikbern/ann-benchmarks
- BEIR: https://github.com/beir-cellar/beir
- VectorDBBench: https://github.com/zilliztech/VectorDBBench
- MTEB: https://github.com/embeddings-benchmark/mteb
Status: Ready for Phase 1 Kickoff
Last Updated: 2026-06-21
Prepared by: Claude Code MetaHarness Architect