We need a benchmark matrix that measures where GPU acceleration is actually beneficial versus CPU SIMD or CPU-local execution.
Background
The docs now define selective acceleration boundaries, but we still need evidence-based break-even thresholds.
Goals
Create a benchmark plan that compares:
- CPU SIMD
- CPU scalar / baseline
- GPU dispatch
- mixed execution paths
Benchmark Areas
ANN Frontdoor
- HNSW CPU vs GPU-assisted vector search
- batch size sensitivity
- top-k size sensitivity
Tensor Mid-Layer
- CPU SIMD tensor similarity
- GPU contraction / refinement
- mmap-backed artifact access vs device upload
- summary generation and shard relevance scoring
Graph Paths
- bounded graph kernels on GPU
- frontier expansion vs CPU traversal
- neighborhood exploration vs CPU metadata-local traversal
- synchronization-heavy paths vs CPU execution
Cross-cutting
- host↔device transfer costs
- SSD / mmap influence
- cross-shard transfer amplification
- quantized vs non-quantized artifacts
Tensor-Graph extension: dynamic tensor update benchmark track
This issue should explicitly include dynamic tensor-update benchmarks in addition to retrieval-path benchmarks.
Additional benchmark areas
Commit path overhead
- baseline RocksDB transaction
- RocksDB transaction + tensor delta logging
- RocksDB transaction + tensor delta logging + manifest invalidation
Tensor update worker
- small delta patch path
- partial refit path
- full snapshot rebuild path
- rank growth sensitivity
- residual / approximation error tracking
Planner impact
- summary-first vs direct exact fetch
- fallback frequency to exact graph path
- fan-out reduction
- false-negative risk in routing
CPU/GPU break-even for tensor maintenance
- batch size sweep
- tensor density sweep
- rank sweep
- host↔device transfer overhead
- CPU SIMD vs GPU for bounded tensor update windows
Additional deliverables
- benchmark binary plan (
bench_tensor_*)
- ctest smoke coverage for benchmark executables
- performance baselines for:
- commit latency
- update throughput
- rebuild latency
- routing quality
- exact fallback rate
Additional acceptance criteria
- benchmark matrix covers both query-time and update-time tensor workloads
- benchmark scenarios include dynamic tensor-update maintenance
- benchmark suite can evaluate patch vs partial refit vs rebuild
- benchmark outputs are usable by planner and lifecycle decisions
Deliverables
- benchmark matrix document
- scenario catalog
- metrics list:
- latency
- throughput
- memory / VRAM usage
- transfer overhead
- recall / quality impact where applicable
Acceptance Criteria
- includes CPU vs GPU comparisons for ANN, Tensor, and bounded Graph paths
- defines break-even metrics
- distinguishes acceleration-friendly vs acceleration-hostile workloads
- produces inputs usable by roadmap and planner decisions
We need a benchmark matrix that measures where GPU acceleration is actually beneficial versus CPU SIMD or CPU-local execution.
Background
The docs now define selective acceleration boundaries, but we still need evidence-based break-even thresholds.
Goals
Create a benchmark plan that compares:
Benchmark Areas
ANN Frontdoor
Tensor Mid-Layer
Graph Paths
Cross-cutting
Tensor-Graph extension: dynamic tensor update benchmark track
This issue should explicitly include dynamic tensor-update benchmarks in addition to retrieval-path benchmarks.
Additional benchmark areas
Commit path overhead
Tensor update worker
Planner impact
CPU/GPU break-even for tensor maintenance
Additional deliverables
bench_tensor_*)Additional acceptance criteria
Deliverables
Acceptance Criteria