feat: update lumina lib to v0.2.2 (#287)

lxy-9602 · web-flow · commit 3e4888b5a92a · 2026-05-18T14:02:32.000+08:00
diff --git a/third_party/lumina/VERSION b/third_party/lumina/VERSION
@@ -1,2 +1,2 @@
-tag: v0.2.1
-c88ce90ed44b7037e3a307a36627cbd030e5eb60
+tag: v0.2.2
+89ea85ed8ef350455a15e1a2519271007ae15807
diff --git a/third_party/lumina/lib/liblumina.so b/third_party/lumina/lib/liblumina.so
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ba850eecf0589f4defa8f34412d5a91253c622600a95ece970e5cc125ee0585
-size 70691192
+oid sha256:e3d705af0cbeeed52f61875a450649dbe651ded22c534636e1e15deb8d5cc97c
+size 70689096
diff --git a/third_party/lumina/reference/DiskANNParameters.md b/third_party/lumina/reference/DiskANNParameters.md
@@ -2,9 +2,9 @@
 
 ## Overview
 
-This document describes DiskANN configuration and tuning in Lumina. It focuses on builder parameters in
-`BuilderOptions`, searcher parameters in `SearcherOptions`, and per-query parameters in `SearchOptions`. The content
-is aligned with the current release's implementation , including parameter meanings and tuning guidance.
+This guide explains how to configure and tune DiskANN in Lumina. It covers builder parameters in `BuilderOptions`,
+searcher parameters in `SearcherOptions`, and per-query parameters in `SearchOptions`. The content matches the current
+release implementation, including parameter meanings and tuning guidance.
 
 This is an implementation-oriented usage guide, not a separate long-term compatibility contract.
 
@@ -102,9 +102,13 @@ These keys belong to `api::SearchOptions`.
 | `search.parallel_number` | `FieldType::kInt` | Query parallelism. Valid range is `1..1000`; values larger than the index node count are clipped to the node count. |
 | `search.thread_safe_filter` | `FieldType::kBool` | Only meaningful for filtered search. If `parallel_number > 1`, filtered search requires it to be `true`. |
 | `diskann.search.list_size` | `FieldType::kInt` | Required. The effective value is raised to at least `topk` and capped by the index node count. |
-| `diskann.search.io_limit` | `FieldType::kInt` | Upper bound on the number of IO operations allowed for a single query. In normal mode, it is roughly "how many nodes may be read"; with sector-aligned read, it is closer to "how many sectors may be read". The effective value is automatically aligned with `topk` and `list_size`: at least `topk`, at most `list_size`. |
+| `diskann.search.io_limit` | `FieldType::kInt` | Upper bound on the number of IO operations allowed for a single query. In normal mode, it is roughly "how many nodes may be read"; with sector-aligned read, it is closer to "how many sectors may be read". Defaults to the maximum of `topk` and `list_size`. When explicitly configured, the effective value is at least `topk` and never exceeds the index node count. |
 | `diskann.search.beam_width` | `FieldType::kInt` | Kept in schema for compatibility, but the current DiskANN backend does not read it. |
 
+Implementation note: the public validation path of `LuminaSearcher::Search()` only validates the generic `search.*`
+schema. DiskANN-specific query parameters are mainly checked inside the DiskANN backend. Therefore, if you build
+options from a string map, prefer normalized entry points such as [NormalizeSearchOptions](../api/Options.md).
+
 ## Tuning Notes
 
 - Graph-build parameters should be considered together: `diskann.build.ef_construction` controls the candidate pool
@@ -126,8 +130,10 @@ These keys belong to `api::SearchOptions`.
   `diskann.build.slack_pruning_factor` only when a more precise trade-off is needed.
 - For an already-built graph, if you want higher recall, increase `diskann.search.list_size` first. If you want
   lower query latency, increase `search.parallel_number` first; `2-4` is a common range.
-- If you want to limit per-query disk reads, tune `diskann.search.io_limit`. In practice, start from a value close to
-  `list_size`, then gradually lower it based on the latency/recall trade-off.
+- To limit per-query disk reads, tune `diskann.search.io_limit`. The default equals `max(topk, list_size)`, which
+  is usually a reasonable starting point. Lowering it below `list_size` reduces IO at the potential cost of recall;
+  raising it above `list_size` allows more IO and may improve recall at the cost of higher latency. Adjust
+  incrementally based on the latency/recall trade-off for your workload.
 - For workloads with many repeated queries, try `diskann.search.num_nodes_to_cache` first, and then decide whether to
   increase query parallelism further.
 - For disk-locality tuning, tune `diskann.build.reorder_layout` and `diskann.search.sector_aligned_read` together.
@@ -141,4 +147,4 @@ These keys belong to `api::SearchOptions`.
 
 ## Status
 
-v0.2.1 Release Tag (2026-04-07).
+v0.2.2 Release Tag (2026-05-14).
diff --git a/third_party/lumina/reference/Limitations.md b/third_party/lumina/reference/Limitations.md
@@ -0,0 +1,47 @@
+# Limitations
+
+Use this page to check the known limitations and constraints of the current Lumina release before choosing a backend or
+upgrading persisted indexes.
+
+## 1. Feature limitations
+
+### Backend support
+
+- **IVF distance metric**: the IVF backend currently supports `L2` only. `Cosine` and `InnerProduct` are under
+  development.
+- **DiskANN dynamic updates**: DiskANN currently supports offline build and static search only (no incremental
+  insert/delete).
+- **Bruteforce scale**: the Bruteforce backend is not optimized for extremely large datasets; use it mainly as a
+  baseline or for smaller scales.
+
+### Data model
+
+- **Fixed dimension**: vector dimension must stay consistent between build and search; dynamic dimension is not
+  supported.
+- **ID type**: `vector_id_t` is `uint64_t`.
+
+## 2. Performance & resources
+
+### Memory usage
+
+- **Sampling in `PretrainFrom`**: training may sample vectors via `index.pretrain_sample_ratio`; high ratios can
+  increase memory pressure.
+- **Streaming ingestion**: `InsertFrom(Dataset&)` reads batch-by-batch, but backends (Bruteforce/IVF) still allocate
+  memory based on algorithm needs.
+
+### Concurrency
+
+- **Builder is not thread-safe**: `LuminaBuilder` must be used from a single thread or externally synchronized.
+- **Global executor**: internal thread pool size is controlled globally via `LUMINA_EXECUTOR_THREAD_COUNT`.
+
+## 3. IO & persistence
+
+### File format compatibility
+
+- **Format versioning**: Stable Lumina persisted artifacts (`.lmi`) are versioned by the major version, meaning the
+  first segment in semantic versions such as `1.x.y` to `2.x.y`. Major-version upgrades may break binary compatibility
+  and require rebuilding indexes. For stable (non-experimental) index formats, minor and patch upgrades within the same
+  major version remain compatible and do not require rebuilds solely because of the version upgrade.
+  The IVF snapshot layout is experimental.
+- **CRC verification cost**: enabling section CRC verification (`io.verify_crc=true`) costs ~1–3% performance (file
+  header/footer CRC is always verified).
diff --git a/third_party/lumina/reference/Overview.md b/third_party/lumina/reference/Overview.md
@@ -0,0 +1,106 @@
+# Overview
+
+Lumina is a C++ library for high-performance vector search and persisted indexes. It provides production-oriented
+backends (DiskANN / IVF / Bruteforce), a narrow API surface, and extension points for advanced workflows.
+
+In addition to the core C++ API, Lumina also provides an experimental Python interface covering index building, search, and other basic workflows.
+
+## Why Lumina?
+
+Lumina is designed as a production-grade search infrastructure component. Every design decision —
+from the API surface to the index format — is made with long-term maintainability and operational reliability in mind.
+
+1. **Mature, deliberate API design**
+   A minimal interface with type-safe, exception-safe error handling. All backends share a unified configuration
+   system — switching backends is a configuration change, not a code rewrite.
+
+2. **Index format you can trust**
+   Persisted indexes follow a versioned format with built-in integrity checks. Upgrades within a compatible
+   version range do not require an index rebuild. The same format works across local storage, memory-mapped
+   files, and distributed file systems.
+
+3. **Keeps pace with research**
+   Core algorithms incorporate results from recent literature — RabitQ quantization, graph pruning
+   heuristics, locality-aware disk reordering — and ship as production features, not perpetual experiments.
+4. **Deep C++ engineering foundation**
+   Resource ownership is explicit and predictable. Memory allocation is tiered and controllable — critical
+   for multi-tenant deployments. The codebase is built on modern C++ standards with strict engineering
+   governance: mandatory code review, pre-commit validation, versioned release trains, and a compatibility
+   policy that distinguishes stable from experimental surfaces. Every release is a deliberate, tested artifact.
+
+5. **Pluggable IO for any storage topology**
+   The IO layer accepts user-supplied readers and writers, decoupling index logic from storage. The same
+   index binary can be served from local SSD, object storage, or a distributed file system without changes
+   to the core library — enabling storage-compute separation and cloud-native deployments out of the box.
+6. **Typed extension framework**
+   Vector search in production demands capabilities beyond pure ANN — filtering, checkpointing, distributed
+   builds — yet bundling them all into the core API would bloat the interface and couple unrelated concerns.
+   Lumina addresses this with a typed extension layer: each capability attaches to a Builder or Searcher instance
+   through a contract that specifies lifecycle ownership, thread-safety semantics, and supported backends.
+   Incompatible combinations are rejected at attach time with a clear error, not discovered at query time.
+
+   | Extension | Status |
+   |-----------|--------|
+   | Attribute-based filtered search | stable |
+   | Build checkpointing | experimental |
+   | Range & discrete-label filtering | planned |
+   | Distributed build coordination | planned |
+
+## Backends at a glance
+
+### DiskANN
+
+**Scale**: billions of vectors. **Memory**: sub-linear — graph metadata, quantized codes, and a configurable hot-node cache reside in RAM; full-precision or higher-precision quantized vectors stay on disk.
+
+DiskANN builds a Vamana proximity graph offline, then serves queries through a coroutine-based parallel beam search that issues batched, sector-aligned disk reads without blocking threads on I/O. Key engineering choices:
+
+- **Layout optimization** — After graph construction, a locality-aware reordering pass (BNP/BNF) places neighboring nodes into the same disk sector, reducing random I/O during search.
+- **Two-tier caching** — A static cache (BFS-loaded entry-region nodes) absorbs the first hops; a dynamic LRU cache adapts to workload skew at runtime.
+- **Build-time checkpointing** — Long builds can resume from a saved checkpoint after interruption, avoiding full restarts on billion-scale datasets.
+- **Quantization** — Both in-memory and on-disk vectors support SQ8, PQ, and RabitQ encoding. The disk encoding can differ from the in-memory one, trading a small recall margin for significantly smaller index files.
+- **Tag-aware graph construction** (in progress) — Filtered search with label dimensions is under active development.
+
+
+### IVF
+
+**Scale**: millions to tens of millions of vectors. **Memory**: moderate — centroids and quantized codes reside in RAM.
+
+IVF partitions the vector space into inverted lists via k-means clustering, then searches by probing the nearest lists. Supports SQ8, PQ, and RabitQ quantization to control the memory-accuracy tradeoff. Currently supports L2 distance only; Cosine and InnerProduct are under development. The on-disk snapshot layout is experimental and may change across versions.
+
+### Bruteforce
+
+**Scale**: thousands to low millions of vectors. **Memory**: full dataset in RAM.
+
+Bruteforce computes exact distances against every vector — no approximation, no index structure. Use it as a recall-rate baseline for benchmarking other backends, or in production when the dataset is small enough that linear scan meets latency requirements.
+
+## Use cases
+
+- **Vector database backend** — power billion-scale similarity search behind a database or retrieval service.
+- **Recommendation systems** — real-time recall of similar items or users from high-dimensional embeddings.
+- **Image and video search** — fast matching over visual feature vectors.
+- **RAG** — give an LLM a high-performance knowledge-base retrieval layer.
+
+## Core components
+
+| Component | What it does |
+|-----------|-------------|
+| **API layer** | `LuminaBuilder`, `LuminaSearcher`, `Options`, `Query` — your main integration surface |
+| **Python facade** | Experimental `lumina` package wrapping Builder/Searcher, plus a filtered-search wrapper |
+| **Backends** | DiskANN, IVF, Bruteforce — the concrete index algorithms |
+| **Quantizer** | Vector compression and distance estimation: SQ8, PQ, RabitQ |
+| **IO system** | Binary container format with section management and CRC verification |
+| **Telemetry** | Production logging and metrics hooks |
+| **Extensions** | Typed build-time and search-time extension points: filtered search, checkpointing. Explicit lifecycle and thread-safety contracts |
+
+## Our Publications
+
+Research behind Lumina has been published at top-tier database and systems venues:
+
+- **[SIGMOD'26]** Zhiyuan Hua, Qiji Mo, Zebin Yao, Lixiao Cui, Xiaoguang Liu, Gang Wang, Zijing Wei, Xinyu Liu, Tianxiao Tang, Shaozhi Liu, Lin Qu. *Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search.* ACM Conference on Management of Data, 2026. ([arXiv](https://arxiv.org/abs/2510.22316))
+- **[ICDE'26]** Qiji Mo, Zhiyuan Hua, Zebin Yao, Lixiao Cui, Xiaoguang Liu, Gang Wang, Zijing Wei, Xinyu Liu, Tianxiao Tang, Shaozhi Liu, Lin Qu. *Overcoming the Sync-Compute Dilemma in Parallel Graph-Based Vector Retrieval.* IEEE International Conference on Data Engineering, 2026.
+
+## Next steps
+
+- [Python quick start](../PythonQuickStart.md) — run the full build → dump → open → search flow in Python.
+- [DiskANN tuning guide](./DiskANNParameters.md) — graph build and search parameter tuning for DiskANN.
+- [Options reference](./OptionsReference.md) — complete list of configuration keys.
diff --git a/third_party/lumina/reference/QuantizationParameters.md b/third_party/lumina/reference/QuantizationParameters.md
@@ -112,4 +112,4 @@ Current behavior:
 
 ## Status
 
-v0.2.1 Release Tag (2026-04-07).
+v0.2.2 Release Tag (2026-05-14).
diff --git a/third_party/lumina/releases/v0.2.2.md b/third_party/lumina/releases/v0.2.2.md
@@ -0,0 +1,49 @@
+# v0.2.2
+
+## Overview
+
+This is a patch release focusing on **bug fixes** for DiskANN and RaBitQ backends. No new features are introduced.
+
+## Audience
+
+- Users integrating Lumina via the public C++ API or Python interface.
+- Users building indexes offline and serving online search.
+
+## Status
+
+Stable (2026-05-14).
+
+## Compatibility
+
+- **Public API**: stable within `include/lumina/api/**` following Lumina versioning policy.
+- **Source compatibility**: guaranteed within the compatible range (recompile required). ABI compatibility is not
+  promised.
+- **On-disk formats**:
+  - The `.lmi` container format is versioned (current version: `0`) and uses CRC32C for corruption detection.
+  - Backend-specific layouts are excluded from long-term compatibility promises unless explicitly declared stable.
+  - IVF snapshot layout is experimental.
+- **Extensions**:
+  - `SearchWithFilterExtension` is stable and supported by `diskann` and `bruteforce` searchers (not supported by `ivf`).
+  - Checkpoint extension is experimental, backend-specific, and excluded from long-term compatibility promises.
+  - `GetVectorExtension` is experimental and supported only by the `bruteforce` searcher with `rawf32` encoding.
+- **Python**: experimental interface; the API may change across versions and is not covered by stability promises.
+
+## Changes
+
+- Bug Fixes:
+  - RaBitQ: fix 1-bit rabitq query recall too low (#82050746).
+  - DiskANNBackend: fix build graph may not be connected (#82043000).
+  - DiskANNBackend: fix integer division by zero (#81832916).
+  - DiskANNBackend: fix io_limit set up, io_limit is no less than topk and no greater than vector count (#81585231).
+
+## Migration Notes
+
+- No breaking changes. Drop-in replacement for v0.2.1.
+
+## Known Issues
+
+- IVF: only `L2` metric is currently supported.
+- DiskANN: dynamic updates (incremental insert/delete) are still not supported.
+- Builder: `LuminaBuilder` instances remain not thread-safe.
+- IO: `.lmi` and backend layouts are still evolving, so upgrades may require rebuilding indexes.
+- See [Limitations](../reference/Limitations.md) for the complete list.

Original file line number	Diff line number	Diff line change
`@@ -112,4 +112,4 @@ Current behavior:`
`112`	`112`
`113`	`113`	`## Status`
`114`	`114`
`115`		`-v0.2.1 Release Tag (2026-04-07).`
	`115`	`+v0.2.2 Release Tag (2026-05-14).`