Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

TurboHybrid Benchmark Suite

This suite is the benchmark plan for the turbohybrid branch. It deliberately uses three benchmark layers, because a single latency benchmark cannot validate a hybrid retrieval index.

Layers

1. IR quality benchmarks

Use these to prove that the hybrid ranker is good, not just fast.

  • BEIR/MTEB: FIQA, NFCorpus, SciFact, HotpotQA, and additional text retrieval tasks as coverage grows.
  • MS MARCO and TREC-DL: web passage ranking, MRR@10, nDCG@10, Recall@100.
  • MIRACL: multilingual retrieval and tokenizer/stemming sensitivity.
  • LoTTE: long-tail entity and forum retrieval.
  • BRIGHT: harder reasoning-oriented retrieval.
  • Small RAG/end-to-end set: retrieval context recall plus answer metrics.

All quality runs should emit standard TREC run files:

qid Q0 docid rank score method

Score them with:

python3 benchmarks/turbohybrid/suite.py score-run \
  --qrels path/to/qrels.trec \
  --run path/to/run.trec \
  --k 10,100

The scorer reports Recall, nDCG, MRR, and MAP at each requested cutoff. Dataset definitions live in config/datasets.json.

2. Systems benchmarks

Use these to prove that the implementation is fast and operationally sane:

  • hot and cold latency
  • p50, p95, p99
  • QPS
  • build time
  • index size
  • WAL generated by build, insert, delete, and vacuum
  • insert/delete/vacuum cost
  • filter selectivity behavior at unfiltered, 10%, and 1% selectivity
  • Postgres memory settings recorded with each run

Branch-local smoke run:

PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH" \
python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
  --database contrib_regression \
  --rows 10000 \
  --dimensions 64 \
  --runs 30 \
  --cold-runs 3 \
  --methods turbohybrid,postgres_sql_rrf \
  --output /tmp/turbohybrid_system.json

cold-runs without --cold-command only uses fresh psql sessions. For real cold-cache measurements, provide an explicit command that restarts Postgres or flushes the target environment:

python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
  --cold-command "brew services restart postgresql@16 && sleep 2"

Use a stable dedicated host for publishable numbers. The synthetic runner is useful for branch regression and method comparison; larger public results should pin dataset, hardware, Postgres settings, extension commit, and query files.

3. Reference-tool comparisons

Compare TurboHybrid against these baselines:

  • Postgres FTS + pgvector SQL-level RRF: in-process reference baseline.
  • Pyserini/Anserini/Lucene BM25: sparse IR reference.
  • Lucene BM25 plus dense run fusion: external hybrid reference.
  • ParadeDB BM25: PostgreSQL BM25 systems baseline.
  • Optional external hybrid systems: Elastic/OpenSearch, Qdrant, Weaviate.

ParadeDB's public benchmark suite is mostly a single-query latency benchmark: it runs dataset query files, records hot/cold measurements, waits for stable timing, uses pg_stat_statements, and defaults to a Stack Overflow dataset. That makes it a good model for the systems harness, but it is not enough to validate hybrid RAG quality. Treat ParadeDB as a performance baseline and keep BEIR/MS MARCO/MIRACL/LoTTE/BRIGHT/RAG quality scores separate.

Commands

List configured datasets and methods:

python3 benchmarks/turbohybrid/suite.py list

Emit the benchmark matrix as JSON:

python3 benchmarks/turbohybrid/suite.py plan

Run the built-in systems benchmark:

python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
  --database contrib_regression \
  --rows 1000 \
  --runs 10 \
  --methods turbohybrid,postgres_sql_rrf

Run the real FIQA RAG benchmark with cached text-embedding vectors:

PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH" \
python3 benchmarks/turbohybrid/suite.py run-real-rag \
  --database contrib_regression \
  --dataset .cache/beir/fiqa \
  --dataset-name fiqa-openai \
  --methods postgres_sql_rrf,turbohybrid,turbohybrid_exact_storage_off \
  --dense-k 400 \
  --bm25-k 400 \
  --final-k 10 \
  --output-json benchmarks/turbohybrid/results/real_rag_fiqa.json \
  --output-md benchmarks/turbohybrid/results/real_rag_fiqa.md

postgres_sql_rrf is the normal pgvector baseline: HNSW over the OpenAI embedding column, PostgreSQL full-text search over the same documents, and SQL-level reciprocal rank fusion. turbohybrid uses one hybrid index over the same vector and tsvector columns with full-precision exact vectors retained for rescoring. turbohybrid_exact_storage_off uses the same hybrid index shape with tq_exact_storage = off, so the index stores quantized codes only for the dense path. The runner copies benchmark queries into Postgres and measures latency inside a server-side loop with clock_timestamp() around each retrieval query. Use --max-docs and --max-queries for smoke runs, but publishable numbers should use the full FIQA split.

Run the SIMD hot-path profile:

python3 benchmarks/turbohybrid/suite.py run-simd-profile \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 768 \
  --runs 50 \
  --warmup-runs 10 \
  --dense-k 100 \
  --bm25-k 100 \
  --final-k 20 \
  --output /tmp/turbohybrid_simd_100k_768.json

The SIMD profile rebuilds the synthetic TurboHybrid index, runs dense-only, BM25-only, hybrid, no-lexical-match, and delta-heavy query shapes, and records hybrid_last_scan_stats(), tq_last_scan_stats(), tq_last_simd_stats(), tq_simd_capabilities(), EXPLAIN (ANALYZE, BUFFERS), p50/p95/p99, build time, WAL, and index size. Use --set to compare forced scalar versus auto dispatch:

python3 benchmarks/turbohybrid/suite.py run-simd-profile \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --runs 30 \
  --set "hnsw.tq_simd_force=scalar" \
  --set "hybrid.bm25_simd_force=scalar" \
  --output /tmp/turbohybrid_simd_scalar.json

For engine timing, prefer the in-backend profiler. It keeps one PostgreSQL backend session open, runs each query shape in a PL/pgSQL loop, and reports microsecond latency from clock_timestamp() separately from optional legacy psql subprocess timing:

python3 benchmarks/turbohybrid/suite.py profile-inbackend \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --k 100 \
  --final-k 10 \
  --runs 1000 \
  --warmup-runs 10 \
  --cli-runs 10 \
  --output /tmp/turbohybrid_profile_inbackend.json

On Linux hosts, --perf-command can wrap each measured psql query, for example:

python3 benchmarks/turbohybrid/suite.py run-simd-profile \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --runs 20 \
  --perf-command "perf stat -e cycles,instructions,cache-misses,branches,branch-misses"

Focused SIMD microbenchmarks are available when the full profile is too broad:

python3 benchmarks/turbohybrid/suite.py bench-bm25-decode \
  --database contrib_regression \
  --rows 1000000 \
  --dimensions 768 \
  --encoding auto \
  --runs 30 \
  --output /tmp/bm25_decode_offset16.json

python3 benchmarks/turbohybrid/suite.py bench-bm25-score \
  --database contrib_regression \
  --rows 1000000 \
  --dimensions 768 \
  --query-shape common-term \
  --bm25-k 100 \
  --precompute-tf-norm on \
  --set "hybrid.bm25_simd_force=auto" \
  --output /tmp/bm25_score_precomputed.json

python3 benchmarks/turbohybrid/suite.py bench-exact-rescore \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --metric cosine \
  --rescore-count 400 \
  --set "hnsw.tq_exact_simd_force=auto" \
  --output /tmp/exact_rescore_auto.json

python3 benchmarks/turbohybrid/suite.py bench-weighted-tq \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --tq-bits 4 \
  --weighted on \
  --set "hnsw.tq_graph_avx512_weighted=off" \
  --output /tmp/weighted_tq_avx2.json

python3 benchmarks/turbohybrid/suite.py bench-weighted-tq \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --tq-bits 4 \
  --weighted on \
  --set "hnsw.tq_graph_avx512_weighted=on" \
  --output /tmp/weighted_tq_avx512.json

python3 benchmarks/turbohybrid/suite.py bench-dense-batch \
  --database contrib_regression \
  --rows 100000 \
  --dimensions 1536 \
  --dense-k 400 \
  --set "hnsw.tq_graph_batch_scoring=on" \
  --output /tmp/dense_batch_on.json

bm25_precompute_tf_norm is off by default. When enabled, suitable BM25 chunks use the offset16 layout plus Q16 precomputed term-frequency normalization, which lets the AVX2 and NEON score kernels avoid per-posting document-length gathers. Delta documents still use scalar BM25 scoring until they are compacted into the base postings.

Run a scalar-vs-auto SIMD matrix and emit both machine-readable and Markdown reports:

python3 benchmarks/turbohybrid/suite.py run-simd-matrix \
  --database contrib_regression \
  --rows 10000,100000 \
  --dimensions 768,1536 \
  --candidate-budgets 50,100,400 \
  --runs 30 \
  --warmup-runs 10 \
  --output-json /tmp/turbohybrid_simd_matrix.json \
  --output-md /tmp/turbohybrid_simd_matrix.md

Build modes are controlled through SIMD_BUILD:

make clean && make SIMD_BUILD=portable
make clean && make SIMD_BUILD=native
make clean && make SIMD_BUILD=none

portable is the default and avoids global -march=native; architecture kernels must use target attributes plus runtime dispatch. native is for local benchmarking only. none defines TQ_DISABLE_SIMD=1 and keeps scalar fallback paths available for parity testing.

Run the development acceptance gate and write JSON plus Markdown reports:

python3 benchmarks/turbohybrid/suite.py acceptance \
  --database contrib_regression \
  --output-dir benchmarks/turbohybrid/results

The default acceptance profile is dev: 10k rows, 16 dimensions, one hot run, one concurrency, all in-process methods, all query/mutation shapes, and fast validation. This is the intended branch-level loop after installcheck and the targeted SQL stress scripts. fast validation keeps correctness probes small: it checks deterministic hybrid top-k and one WAND-vs-DAAT rare/intersection query instead of replaying the expensive high-df validation set.

Use smoke during tight C edit loops:

python3 benchmarks/turbohybrid/suite.py acceptance \
  --profile smoke \
  --database contrib_regression \
  --output-dir /tmp/turbohybrid_smoke

smoke runs only TurboHybrid on 1k rows with the smallest shape set that still checks dense, BM25, hybrid, delta insert, compaction behavior, and the fast validation probes.

Use full only before publishing numbers or merging a performance-sensitive change:

python3 benchmarks/turbohybrid/suite.py acceptance \
  --profile full \
  --database contrib_regression \
  --output-dir benchmarks/turbohybrid/results

full runs 10k and 100k rows, 64 dimensions, 30 measured runs, warmup, and concurrency 1,4,16. It intentionally takes much longer because it builds and mutates TurboHybrid, TurboQuant dense, and Postgres SQL RRF baselines. It also uses standard validation, which includes the high-df WAND-vs-DAAT checks that are too expensive for the default edit loop. Latency thresholds are enforced for full runs and for explicit full-equivalent runs with at least 100k rows, 30 measured runs, and concurrency 1,4,16.

Override validation independently when needed:

python3 benchmarks/turbohybrid/suite.py acceptance \
  --database contrib_regression \
  --validation none \
  --output-dir /tmp/turbohybrid_timing_only

Use --validation none for timing-only experiments after a recent correctness gate. Use --validation standard before relying on the results for review, and reserve the full profile for publishable numbers.

Default acceptance thresholds live in benchmarks/turbohybrid/config/acceptance_thresholds.json. Override them with --thresholds path/to/thresholds.json when running on a different host class. The generated JSON and Markdown include the commit SHA, PostgreSQL version, host/RAM information, method DDL, measured GUCs, build/storage/WAL numbers, latency percentiles, scan stats, mutation/compaction counters, and threshold warnings.

To produce 10k-row and 100k-row baseline reports with custom run settings, use:

python3 benchmarks/turbohybrid/suite.py engine-speed \
  --database contrib_regression \
  --profile full \
  --rows-list 10000,100000 \
  --dimensions 1536 \
  --k 100 \
  --final-k 100 \
  --runs 30 \
  --output-dir benchmarks/turbohybrid/results

engine-speed writes engine_speed_<commit>_<timestamp>.json and .md reports. Its primary latency numbers are measured inside one PostgreSQL backend with clock_timestamp() loops, while optional --cli-runs records legacy psql subprocess latency separately. The report includes BM25 strategy/impact diagnostics, dense graph timer breakdowns, and build/WAL/storage tables.

Run the explicit top-N fusion smoke for the prompt-07 target case dense_k=5000, bm25_k=5000, and final_k=20:

psql -X -v ON_ERROR_STOP=1 \
  -d contrib_regression \
  -f test/bench/turbohybrid_fusion_topn.sql

Score a TREC run:

python3 benchmarks/turbohybrid/suite.py score-run \
  --qrels benchmarks/turbohybrid/examples/qrels.trec \
  --run benchmarks/turbohybrid/examples/run.trec

Publishable Scorecard

Every published benchmark should include:

  • commit SHA and extension SQL version
  • CPU, RAM, storage, OS, compiler, Postgres version
  • Postgres settings: shared_buffers, work_mem, maintenance_work_mem, WAL/checkpoint settings, parallelism settings
  • dataset name, split, corpus size, query count, document text field, language
  • embedding model and dimensionality
  • method DDL and all GUCs
  • build time, index size, WAL bytes
  • p50/p95/p99, hot/cold methodology, QPS and concurrency
  • quality metrics with qrels
  • filter selectivity cases
  • insert/delete/vacuum measurements

Current Limitations

  • The synthetic systems runner is not a replacement for BEIR or TREC-DL relevance evaluation. It is a fast branch-level performance check.
  • The in-index BM25 storage spills build runs and chunks large postings lists; publishable high-df runs should still include a dedicated common-term stress case because compression and WAND skipping are sensitive to corpus order.
  • Delta storage currently has an explicit page-size guard for oversized single-row delta tuples. It fails before append with a clear error instead of chunking one huge inserted document across multiple delta tuples.
  • True cold-cache timing requires an external restart/drop-cache command.
  • External baselines are normalized by importing their run files or JSON summaries; this repository does not vendor Pyserini, Lucene, ParadeDB, Elastic, OpenSearch, Qdrant, or Weaviate.
  • RAG answer metrics need a project-specific evaluator because generation model, prompt, and citation policy affect the result.