Skip to content

Define benchmark matrix for CPU SIMD vs GPU dispatch in ANN, Tensor, and Graph paths #5466

@makr-code

Description

@makr-code

We need a benchmark matrix that measures where GPU acceleration is actually beneficial versus CPU SIMD or CPU-local execution.

Background

The docs now define selective acceleration boundaries, but we still need evidence-based break-even thresholds.

Goals

Create a benchmark plan that compares:

  • CPU SIMD
  • CPU scalar / baseline
  • GPU dispatch
  • mixed execution paths

Benchmark Areas

ANN Frontdoor

  • HNSW CPU vs GPU-assisted vector search
  • batch size sensitivity
  • top-k size sensitivity

Tensor Mid-Layer

  • CPU SIMD tensor similarity
  • GPU contraction / refinement
  • mmap-backed artifact access vs device upload
  • summary generation and shard relevance scoring

Graph Paths

  • bounded graph kernels on GPU
  • frontier expansion vs CPU traversal
  • neighborhood exploration vs CPU metadata-local traversal
  • synchronization-heavy paths vs CPU execution

Cross-cutting

  • host↔device transfer costs
  • SSD / mmap influence
  • cross-shard transfer amplification
  • quantized vs non-quantized artifacts

Tensor-Graph extension: dynamic tensor update benchmark track

This issue should explicitly include dynamic tensor-update benchmarks in addition to retrieval-path benchmarks.

Additional benchmark areas

Commit path overhead

  • baseline RocksDB transaction
  • RocksDB transaction + tensor delta logging
  • RocksDB transaction + tensor delta logging + manifest invalidation

Tensor update worker

  • small delta patch path
  • partial refit path
  • full snapshot rebuild path
  • rank growth sensitivity
  • residual / approximation error tracking

Planner impact

  • summary-first vs direct exact fetch
  • fallback frequency to exact graph path
  • fan-out reduction
  • false-negative risk in routing

CPU/GPU break-even for tensor maintenance

  • batch size sweep
  • tensor density sweep
  • rank sweep
  • host↔device transfer overhead
  • CPU SIMD vs GPU for bounded tensor update windows

Additional deliverables

  • benchmark binary plan (bench_tensor_*)
  • ctest smoke coverage for benchmark executables
  • performance baselines for:
    • commit latency
    • update throughput
    • rebuild latency
    • routing quality
    • exact fallback rate

Additional acceptance criteria

  • benchmark matrix covers both query-time and update-time tensor workloads
  • benchmark scenarios include dynamic tensor-update maintenance
  • benchmark suite can evaluate patch vs partial refit vs rebuild
  • benchmark outputs are usable by planner and lifecycle decisions

Deliverables

  • benchmark matrix document
  • scenario catalog
  • metrics list:
    • latency
    • throughput
    • memory / VRAM usage
    • transfer overhead
    • recall / quality impact where applicable

Acceptance Criteria

  • includes CPU vs GPU comparisons for ANN, Tensor, and bounded Graph paths
  • defines break-even metrics
  • distinguishes acceleration-friendly vs acceleration-hostile workloads
  • produces inputs usable by roadmap and planner decisions

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceAuto-created by issue manager

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions