This suite is the benchmark plan for the turbohybrid branch. It deliberately
uses three benchmark layers, because a single latency benchmark cannot validate
a hybrid retrieval index.
Use these to prove that the hybrid ranker is good, not just fast.
- BEIR/MTEB: FIQA, NFCorpus, SciFact, HotpotQA, and additional text retrieval tasks as coverage grows.
- MS MARCO and TREC-DL: web passage ranking, MRR@10, nDCG@10, Recall@100.
- MIRACL: multilingual retrieval and tokenizer/stemming sensitivity.
- LoTTE: long-tail entity and forum retrieval.
- BRIGHT: harder reasoning-oriented retrieval.
- Small RAG/end-to-end set: retrieval context recall plus answer metrics.
All quality runs should emit standard TREC run files:
qid Q0 docid rank score method
Score them with:
python3 benchmarks/turbohybrid/suite.py score-run \
--qrels path/to/qrels.trec \
--run path/to/run.trec \
--k 10,100The scorer reports Recall, nDCG, MRR, and MAP at each requested cutoff. Dataset
definitions live in config/datasets.json.
Use these to prove that the implementation is fast and operationally sane:
- hot and cold latency
- p50, p95, p99
- QPS
- build time
- index size
- WAL generated by build, insert, delete, and vacuum
- insert/delete/vacuum cost
- filter selectivity behavior at unfiltered, 10%, and 1% selectivity
- Postgres memory settings recorded with each run
Branch-local smoke run:
PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH" \
python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
--database contrib_regression \
--rows 10000 \
--dimensions 64 \
--runs 30 \
--cold-runs 3 \
--methods turbohybrid,postgres_sql_rrf \
--output /tmp/turbohybrid_system.jsoncold-runs without --cold-command only uses fresh psql sessions. For real
cold-cache measurements, provide an explicit command that restarts Postgres or
flushes the target environment:
python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
--cold-command "brew services restart postgresql@16 && sleep 2"Use a stable dedicated host for publishable numbers. The synthetic runner is useful for branch regression and method comparison; larger public results should pin dataset, hardware, Postgres settings, extension commit, and query files.
Compare TurboHybrid against these baselines:
- Postgres FTS + pgvector SQL-level RRF: in-process reference baseline.
- Pyserini/Anserini/Lucene BM25: sparse IR reference.
- Lucene BM25 plus dense run fusion: external hybrid reference.
- ParadeDB BM25: PostgreSQL BM25 systems baseline.
- Optional external hybrid systems: Elastic/OpenSearch, Qdrant, Weaviate.
ParadeDB's public benchmark suite is mostly a single-query latency benchmark:
it runs dataset query files, records hot/cold measurements, waits for stable
timing, uses pg_stat_statements, and defaults to a Stack Overflow dataset.
That makes it a good model for the systems harness, but it is not enough to
validate hybrid RAG quality. Treat ParadeDB as a performance baseline and keep
BEIR/MS MARCO/MIRACL/LoTTE/BRIGHT/RAG quality scores separate.
List configured datasets and methods:
python3 benchmarks/turbohybrid/suite.py listEmit the benchmark matrix as JSON:
python3 benchmarks/turbohybrid/suite.py planRun the built-in systems benchmark:
python3 benchmarks/turbohybrid/suite.py run-system-synthetic \
--database contrib_regression \
--rows 1000 \
--runs 10 \
--methods turbohybrid,postgres_sql_rrfRun the real FIQA RAG benchmark with cached text-embedding vectors:
PATH="/opt/homebrew/opt/postgresql@16/bin:$PATH" \
python3 benchmarks/turbohybrid/suite.py run-real-rag \
--database contrib_regression \
--dataset .cache/beir/fiqa \
--dataset-name fiqa-openai \
--methods postgres_sql_rrf,turbohybrid,turbohybrid_exact_storage_off \
--dense-k 400 \
--bm25-k 400 \
--final-k 10 \
--output-json benchmarks/turbohybrid/results/real_rag_fiqa.json \
--output-md benchmarks/turbohybrid/results/real_rag_fiqa.mdpostgres_sql_rrf is the normal pgvector baseline: HNSW over the OpenAI
embedding column, PostgreSQL full-text search over the same documents, and
SQL-level reciprocal rank fusion. turbohybrid uses one hybrid index over the
same vector and tsvector columns with full-precision exact vectors retained
for rescoring. turbohybrid_exact_storage_off uses the same hybrid index shape
with tq_exact_storage = off, so the index stores quantized codes only for the
dense path. The runner copies benchmark queries into Postgres and measures
latency inside a server-side loop with clock_timestamp() around each retrieval
query. Use --max-docs and --max-queries for smoke runs, but publishable
numbers should use the full FIQA split.
Run the SIMD hot-path profile:
python3 benchmarks/turbohybrid/suite.py run-simd-profile \
--database contrib_regression \
--rows 100000 \
--dimensions 768 \
--runs 50 \
--warmup-runs 10 \
--dense-k 100 \
--bm25-k 100 \
--final-k 20 \
--output /tmp/turbohybrid_simd_100k_768.jsonThe SIMD profile rebuilds the synthetic TurboHybrid index, runs dense-only,
BM25-only, hybrid, no-lexical-match, and delta-heavy query shapes, and records
hybrid_last_scan_stats(), tq_last_scan_stats(), tq_last_simd_stats(),
tq_simd_capabilities(), EXPLAIN (ANALYZE, BUFFERS), p50/p95/p99, build
time, WAL, and index size. Use --set to compare forced scalar versus auto
dispatch:
python3 benchmarks/turbohybrid/suite.py run-simd-profile \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--runs 30 \
--set "hnsw.tq_simd_force=scalar" \
--set "hybrid.bm25_simd_force=scalar" \
--output /tmp/turbohybrid_simd_scalar.jsonFor engine timing, prefer the in-backend profiler. It keeps one PostgreSQL
backend session open, runs each query shape in a PL/pgSQL loop, and reports
microsecond latency from clock_timestamp() separately from optional legacy
psql subprocess timing:
python3 benchmarks/turbohybrid/suite.py profile-inbackend \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--k 100 \
--final-k 10 \
--runs 1000 \
--warmup-runs 10 \
--cli-runs 10 \
--output /tmp/turbohybrid_profile_inbackend.jsonOn Linux hosts, --perf-command can wrap each measured psql query, for example:
python3 benchmarks/turbohybrid/suite.py run-simd-profile \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--runs 20 \
--perf-command "perf stat -e cycles,instructions,cache-misses,branches,branch-misses"Focused SIMD microbenchmarks are available when the full profile is too broad:
python3 benchmarks/turbohybrid/suite.py bench-bm25-decode \
--database contrib_regression \
--rows 1000000 \
--dimensions 768 \
--encoding auto \
--runs 30 \
--output /tmp/bm25_decode_offset16.json
python3 benchmarks/turbohybrid/suite.py bench-bm25-score \
--database contrib_regression \
--rows 1000000 \
--dimensions 768 \
--query-shape common-term \
--bm25-k 100 \
--precompute-tf-norm on \
--set "hybrid.bm25_simd_force=auto" \
--output /tmp/bm25_score_precomputed.json
python3 benchmarks/turbohybrid/suite.py bench-exact-rescore \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--metric cosine \
--rescore-count 400 \
--set "hnsw.tq_exact_simd_force=auto" \
--output /tmp/exact_rescore_auto.json
python3 benchmarks/turbohybrid/suite.py bench-weighted-tq \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--tq-bits 4 \
--weighted on \
--set "hnsw.tq_graph_avx512_weighted=off" \
--output /tmp/weighted_tq_avx2.json
python3 benchmarks/turbohybrid/suite.py bench-weighted-tq \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--tq-bits 4 \
--weighted on \
--set "hnsw.tq_graph_avx512_weighted=on" \
--output /tmp/weighted_tq_avx512.json
python3 benchmarks/turbohybrid/suite.py bench-dense-batch \
--database contrib_regression \
--rows 100000 \
--dimensions 1536 \
--dense-k 400 \
--set "hnsw.tq_graph_batch_scoring=on" \
--output /tmp/dense_batch_on.jsonbm25_precompute_tf_norm is off by default. When enabled, suitable BM25 chunks
use the offset16 layout plus Q16 precomputed term-frequency normalization, which
lets the AVX2 and NEON score kernels avoid per-posting document-length gathers.
Delta documents still use scalar BM25 scoring until they are compacted into the
base postings.
Run a scalar-vs-auto SIMD matrix and emit both machine-readable and Markdown reports:
python3 benchmarks/turbohybrid/suite.py run-simd-matrix \
--database contrib_regression \
--rows 10000,100000 \
--dimensions 768,1536 \
--candidate-budgets 50,100,400 \
--runs 30 \
--warmup-runs 10 \
--output-json /tmp/turbohybrid_simd_matrix.json \
--output-md /tmp/turbohybrid_simd_matrix.mdBuild modes are controlled through SIMD_BUILD:
make clean && make SIMD_BUILD=portable
make clean && make SIMD_BUILD=native
make clean && make SIMD_BUILD=noneportable is the default and avoids global -march=native; architecture
kernels must use target attributes plus runtime dispatch. native is for local
benchmarking only. none defines TQ_DISABLE_SIMD=1 and keeps scalar fallback
paths available for parity testing.
Run the development acceptance gate and write JSON plus Markdown reports:
python3 benchmarks/turbohybrid/suite.py acceptance \
--database contrib_regression \
--output-dir benchmarks/turbohybrid/resultsThe default acceptance profile is dev: 10k rows, 16 dimensions, one hot run,
one concurrency, all in-process methods, all query/mutation shapes, and fast
validation. This is the intended branch-level loop after installcheck and the
targeted SQL stress scripts. fast validation keeps correctness probes small:
it checks deterministic hybrid top-k and one WAND-vs-DAAT rare/intersection
query instead of replaying the expensive high-df validation set.
Use smoke during tight C edit loops:
python3 benchmarks/turbohybrid/suite.py acceptance \
--profile smoke \
--database contrib_regression \
--output-dir /tmp/turbohybrid_smokesmoke runs only TurboHybrid on 1k rows with the smallest shape set that still
checks dense, BM25, hybrid, delta insert, compaction behavior, and the fast
validation probes.
Use full only before publishing numbers or merging a performance-sensitive
change:
python3 benchmarks/turbohybrid/suite.py acceptance \
--profile full \
--database contrib_regression \
--output-dir benchmarks/turbohybrid/resultsfull runs 10k and 100k rows, 64 dimensions, 30 measured runs, warmup, and
concurrency 1,4,16. It intentionally takes much longer because it builds and
mutates TurboHybrid, TurboQuant dense, and Postgres SQL RRF baselines.
It also uses standard validation, which includes the high-df WAND-vs-DAAT
checks that are too expensive for the default edit loop.
Latency thresholds are enforced for full runs and for explicit full-equivalent
runs with at least 100k rows, 30 measured runs, and concurrency 1,4,16.
Override validation independently when needed:
python3 benchmarks/turbohybrid/suite.py acceptance \
--database contrib_regression \
--validation none \
--output-dir /tmp/turbohybrid_timing_onlyUse --validation none for timing-only experiments after a recent correctness
gate. Use --validation standard before relying on the results for review, and
reserve the full profile for publishable numbers.
Default acceptance thresholds live in
benchmarks/turbohybrid/config/acceptance_thresholds.json. Override them with
--thresholds path/to/thresholds.json when running on a different host class.
The generated JSON and Markdown include the commit SHA, PostgreSQL version,
host/RAM information, method DDL, measured GUCs, build/storage/WAL numbers,
latency percentiles, scan stats, mutation/compaction counters, and threshold
warnings.
To produce 10k-row and 100k-row baseline reports with custom run settings, use:
python3 benchmarks/turbohybrid/suite.py engine-speed \
--database contrib_regression \
--profile full \
--rows-list 10000,100000 \
--dimensions 1536 \
--k 100 \
--final-k 100 \
--runs 30 \
--output-dir benchmarks/turbohybrid/resultsengine-speed writes engine_speed_<commit>_<timestamp>.json and .md
reports. Its primary latency numbers are measured inside one PostgreSQL backend
with clock_timestamp() loops, while optional --cli-runs records legacy
psql subprocess latency separately. The report includes BM25 strategy/impact
diagnostics, dense graph timer breakdowns, and build/WAL/storage tables.
Run the explicit top-N fusion smoke for the prompt-07 target case
dense_k=5000, bm25_k=5000, and final_k=20:
psql -X -v ON_ERROR_STOP=1 \
-d contrib_regression \
-f test/bench/turbohybrid_fusion_topn.sqlScore a TREC run:
python3 benchmarks/turbohybrid/suite.py score-run \
--qrels benchmarks/turbohybrid/examples/qrels.trec \
--run benchmarks/turbohybrid/examples/run.trecEvery published benchmark should include:
- commit SHA and extension SQL version
- CPU, RAM, storage, OS, compiler, Postgres version
- Postgres settings:
shared_buffers,work_mem,maintenance_work_mem, WAL/checkpoint settings, parallelism settings - dataset name, split, corpus size, query count, document text field, language
- embedding model and dimensionality
- method DDL and all GUCs
- build time, index size, WAL bytes
- p50/p95/p99, hot/cold methodology, QPS and concurrency
- quality metrics with qrels
- filter selectivity cases
- insert/delete/vacuum measurements
- The synthetic systems runner is not a replacement for BEIR or TREC-DL relevance evaluation. It is a fast branch-level performance check.
- The in-index BM25 storage spills build runs and chunks large postings lists; publishable high-df runs should still include a dedicated common-term stress case because compression and WAND skipping are sensitive to corpus order.
- Delta storage currently has an explicit page-size guard for oversized single-row delta tuples. It fails before append with a clear error instead of chunking one huge inserted document across multiple delta tuples.
- True cold-cache timing requires an external restart/drop-cache command.
- External baselines are normalized by importing their run files or JSON summaries; this repository does not vendor Pyserini, Lucene, ParadeDB, Elastic, OpenSearch, Qdrant, or Weaviate.
- RAG answer metrics need a project-specific evaluator because generation model, prompt, and citation policy affect the result.