Skip to content

Commit 3e4888b

Browse files
authored
feat: update lumina lib to v0.2.2 (#287)
1 parent 79917cc commit 3e4888b

7 files changed

Lines changed: 220 additions & 12 deletions

File tree

third_party/lumina/VERSION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
1-
tag: v0.2.1
2-
c88ce90ed44b7037e3a307a36627cbd030e5eb60
1+
tag: v0.2.2
2+
89ea85ed8ef350455a15e1a2519271007ae15807
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
version https://git-lfs.github.com/spec/v1
2-
oid sha256:0ba850eecf0589f4defa8f34412d5a91253c622600a95ece970e5cc125ee0585
3-
size 70691192
2+
oid sha256:e3d705af0cbeeed52f61875a450649dbe651ded22c534636e1e15deb8d5cc97c
3+
size 70689096

third_party/lumina/reference/DiskANNParameters.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## Overview
44

5-
This document describes DiskANN configuration and tuning in Lumina. It focuses on builder parameters in
6-
`BuilderOptions`, searcher parameters in `SearcherOptions`, and per-query parameters in `SearchOptions`. The content
7-
is aligned with the current release's implementation , including parameter meanings and tuning guidance.
5+
This guide explains how to configure and tune DiskANN in Lumina. It covers builder parameters in `BuilderOptions`,
6+
searcher parameters in `SearcherOptions`, and per-query parameters in `SearchOptions`. The content matches the current
7+
release implementation, including parameter meanings and tuning guidance.
88

99
This is an implementation-oriented usage guide, not a separate long-term compatibility contract.
1010

@@ -102,9 +102,13 @@ These keys belong to `api::SearchOptions`.
102102
| `search.parallel_number` | `FieldType::kInt` | Query parallelism. Valid range is `1..1000`; values larger than the index node count are clipped to the node count. |
103103
| `search.thread_safe_filter` | `FieldType::kBool` | Only meaningful for filtered search. If `parallel_number > 1`, filtered search requires it to be `true`. |
104104
| `diskann.search.list_size` | `FieldType::kInt` | Required. The effective value is raised to at least `topk` and capped by the index node count. |
105-
| `diskann.search.io_limit` | `FieldType::kInt` | Upper bound on the number of IO operations allowed for a single query. In normal mode, it is roughly "how many nodes may be read"; with sector-aligned read, it is closer to "how many sectors may be read". The effective value is automatically aligned with `topk` and `list_size`: at least `topk`, at most `list_size`. |
105+
| `diskann.search.io_limit` | `FieldType::kInt` | Upper bound on the number of IO operations allowed for a single query. In normal mode, it is roughly "how many nodes may be read"; with sector-aligned read, it is closer to "how many sectors may be read". Defaults to the maximum of `topk` and `list_size`. When explicitly configured, the effective value is at least `topk` and never exceeds the index node count. |
106106
| `diskann.search.beam_width` | `FieldType::kInt` | Kept in schema for compatibility, but the current DiskANN backend does not read it. |
107107

108+
Implementation note: the public validation path of `LuminaSearcher::Search()` only validates the generic `search.*`
109+
schema. DiskANN-specific query parameters are mainly checked inside the DiskANN backend. Therefore, if you build
110+
options from a string map, prefer normalized entry points such as [NormalizeSearchOptions](../api/Options.md).
111+
108112
## Tuning Notes
109113

110114
- Graph-build parameters should be considered together: `diskann.build.ef_construction` controls the candidate pool
@@ -126,8 +130,10 @@ These keys belong to `api::SearchOptions`.
126130
`diskann.build.slack_pruning_factor` only when a more precise trade-off is needed.
127131
- For an already-built graph, if you want higher recall, increase `diskann.search.list_size` first. If you want
128132
lower query latency, increase `search.parallel_number` first; `2-4` is a common range.
129-
- If you want to limit per-query disk reads, tune `diskann.search.io_limit`. In practice, start from a value close to
130-
`list_size`, then gradually lower it based on the latency/recall trade-off.
133+
- To limit per-query disk reads, tune `diskann.search.io_limit`. The default equals `max(topk, list_size)`, which
134+
is usually a reasonable starting point. Lowering it below `list_size` reduces IO at the potential cost of recall;
135+
raising it above `list_size` allows more IO and may improve recall at the cost of higher latency. Adjust
136+
incrementally based on the latency/recall trade-off for your workload.
131137
- For workloads with many repeated queries, try `diskann.search.num_nodes_to_cache` first, and then decide whether to
132138
increase query parallelism further.
133139
- For disk-locality tuning, tune `diskann.build.reorder_layout` and `diskann.search.sector_aligned_read` together.
@@ -141,4 +147,4 @@ These keys belong to `api::SearchOptions`.
141147

142148
## Status
143149

144-
v0.2.1 Release Tag (2026-04-07).
150+
v0.2.2 Release Tag (2026-05-14).
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Limitations
2+
3+
Use this page to check the known limitations and constraints of the current Lumina release before choosing a backend or
4+
upgrading persisted indexes.
5+
6+
## 1. Feature limitations
7+
8+
### Backend support
9+
10+
- **IVF distance metric**: the IVF backend currently supports `L2` only. `Cosine` and `InnerProduct` are under
11+
development.
12+
- **DiskANN dynamic updates**: DiskANN currently supports offline build and static search only (no incremental
13+
insert/delete).
14+
- **Bruteforce scale**: the Bruteforce backend is not optimized for extremely large datasets; use it mainly as a
15+
baseline or for smaller scales.
16+
17+
### Data model
18+
19+
- **Fixed dimension**: vector dimension must stay consistent between build and search; dynamic dimension is not
20+
supported.
21+
- **ID type**: `vector_id_t` is `uint64_t`.
22+
23+
## 2. Performance & resources
24+
25+
### Memory usage
26+
27+
- **Sampling in `PretrainFrom`**: training may sample vectors via `index.pretrain_sample_ratio`; high ratios can
28+
increase memory pressure.
29+
- **Streaming ingestion**: `InsertFrom(Dataset&)` reads batch-by-batch, but backends (Bruteforce/IVF) still allocate
30+
memory based on algorithm needs.
31+
32+
### Concurrency
33+
34+
- **Builder is not thread-safe**: `LuminaBuilder` must be used from a single thread or externally synchronized.
35+
- **Global executor**: internal thread pool size is controlled globally via `LUMINA_EXECUTOR_THREAD_COUNT`.
36+
37+
## 3. IO & persistence
38+
39+
### File format compatibility
40+
41+
- **Format versioning**: Stable Lumina persisted artifacts (`.lmi`) are versioned by the major version, meaning the
42+
first segment in semantic versions such as `1.x.y` to `2.x.y`. Major-version upgrades may break binary compatibility
43+
and require rebuilding indexes. For stable (non-experimental) index formats, minor and patch upgrades within the same
44+
major version remain compatible and do not require rebuilds solely because of the version upgrade.
45+
The IVF snapshot layout is experimental.
46+
- **CRC verification cost**: enabling section CRC verification (`io.verify_crc=true`) costs ~1–3% performance (file
47+
header/footer CRC is always verified).
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Overview
2+
3+
Lumina is a C++ library for high-performance vector search and persisted indexes. It provides production-oriented
4+
backends (DiskANN / IVF / Bruteforce), a narrow API surface, and extension points for advanced workflows.
5+
6+
In addition to the core C++ API, Lumina also provides an experimental Python interface covering index building, search, and other basic workflows.
7+
8+
## Why Lumina?
9+
10+
Lumina is designed as a production-grade search infrastructure component. Every design decision —
11+
from the API surface to the index format — is made with long-term maintainability and operational reliability in mind.
12+
13+
1. **Mature, deliberate API design**
14+
A minimal interface with type-safe, exception-safe error handling. All backends share a unified configuration
15+
system — switching backends is a configuration change, not a code rewrite.
16+
17+
2. **Index format you can trust**
18+
Persisted indexes follow a versioned format with built-in integrity checks. Upgrades within a compatible
19+
version range do not require an index rebuild. The same format works across local storage, memory-mapped
20+
files, and distributed file systems.
21+
22+
3. **Keeps pace with research**
23+
Core algorithms incorporate results from recent literature — RabitQ quantization, graph pruning
24+
heuristics, locality-aware disk reordering — and ship as production features, not perpetual experiments.
25+
4. **Deep C++ engineering foundation**
26+
Resource ownership is explicit and predictable. Memory allocation is tiered and controllable — critical
27+
for multi-tenant deployments. The codebase is built on modern C++ standards with strict engineering
28+
governance: mandatory code review, pre-commit validation, versioned release trains, and a compatibility
29+
policy that distinguishes stable from experimental surfaces. Every release is a deliberate, tested artifact.
30+
31+
5. **Pluggable IO for any storage topology**
32+
The IO layer accepts user-supplied readers and writers, decoupling index logic from storage. The same
33+
index binary can be served from local SSD, object storage, or a distributed file system without changes
34+
to the core library — enabling storage-compute separation and cloud-native deployments out of the box.
35+
6. **Typed extension framework**
36+
Vector search in production demands capabilities beyond pure ANN — filtering, checkpointing, distributed
37+
builds — yet bundling them all into the core API would bloat the interface and couple unrelated concerns.
38+
Lumina addresses this with a typed extension layer: each capability attaches to a Builder or Searcher instance
39+
through a contract that specifies lifecycle ownership, thread-safety semantics, and supported backends.
40+
Incompatible combinations are rejected at attach time with a clear error, not discovered at query time.
41+
42+
| Extension | Status |
43+
|-----------|--------|
44+
| Attribute-based filtered search | stable |
45+
| Build checkpointing | experimental |
46+
| Range & discrete-label filtering | planned |
47+
| Distributed build coordination | planned |
48+
49+
## Backends at a glance
50+
51+
### DiskANN
52+
53+
**Scale**: billions of vectors. **Memory**: sub-linear — graph metadata, quantized codes, and a configurable hot-node cache reside in RAM; full-precision or higher-precision quantized vectors stay on disk.
54+
55+
DiskANN builds a Vamana proximity graph offline, then serves queries through a coroutine-based parallel beam search that issues batched, sector-aligned disk reads without blocking threads on I/O. Key engineering choices:
56+
57+
- **Layout optimization** — After graph construction, a locality-aware reordering pass (BNP/BNF) places neighboring nodes into the same disk sector, reducing random I/O during search.
58+
- **Two-tier caching** — A static cache (BFS-loaded entry-region nodes) absorbs the first hops; a dynamic LRU cache adapts to workload skew at runtime.
59+
- **Build-time checkpointing** — Long builds can resume from a saved checkpoint after interruption, avoiding full restarts on billion-scale datasets.
60+
- **Quantization** — Both in-memory and on-disk vectors support SQ8, PQ, and RabitQ encoding. The disk encoding can differ from the in-memory one, trading a small recall margin for significantly smaller index files.
61+
- **Tag-aware graph construction** (in progress) — Filtered search with label dimensions is under active development.
62+
63+
64+
### IVF
65+
66+
**Scale**: millions to tens of millions of vectors. **Memory**: moderate — centroids and quantized codes reside in RAM.
67+
68+
IVF partitions the vector space into inverted lists via k-means clustering, then searches by probing the nearest lists. Supports SQ8, PQ, and RabitQ quantization to control the memory-accuracy tradeoff. Currently supports L2 distance only; Cosine and InnerProduct are under development. The on-disk snapshot layout is experimental and may change across versions.
69+
70+
### Bruteforce
71+
72+
**Scale**: thousands to low millions of vectors. **Memory**: full dataset in RAM.
73+
74+
Bruteforce computes exact distances against every vector — no approximation, no index structure. Use it as a recall-rate baseline for benchmarking other backends, or in production when the dataset is small enough that linear scan meets latency requirements.
75+
76+
## Use cases
77+
78+
- **Vector database backend** — power billion-scale similarity search behind a database or retrieval service.
79+
- **Recommendation systems** — real-time recall of similar items or users from high-dimensional embeddings.
80+
- **Image and video search** — fast matching over visual feature vectors.
81+
- **RAG** — give an LLM a high-performance knowledge-base retrieval layer.
82+
83+
## Core components
84+
85+
| Component | What it does |
86+
|-----------|-------------|
87+
| **API layer** | `LuminaBuilder`, `LuminaSearcher`, `Options`, `Query` — your main integration surface |
88+
| **Python facade** | Experimental `lumina` package wrapping Builder/Searcher, plus a filtered-search wrapper |
89+
| **Backends** | DiskANN, IVF, Bruteforce — the concrete index algorithms |
90+
| **Quantizer** | Vector compression and distance estimation: SQ8, PQ, RabitQ |
91+
| **IO system** | Binary container format with section management and CRC verification |
92+
| **Telemetry** | Production logging and metrics hooks |
93+
| **Extensions** | Typed build-time and search-time extension points: filtered search, checkpointing. Explicit lifecycle and thread-safety contracts |
94+
95+
## Our Publications
96+
97+
Research behind Lumina has been published at top-tier database and systems venues:
98+
99+
- **[SIGMOD'26]** Zhiyuan Hua, Qiji Mo, Zebin Yao, Lixiao Cui, Xiaoguang Liu, Gang Wang, Zijing Wei, Xinyu Liu, Tianxiao Tang, Shaozhi Liu, Lin Qu. *Dynamically Detect and Fix Hardness for Efficient Approximate Nearest Neighbor Search.* ACM Conference on Management of Data, 2026. ([arXiv](https://arxiv.org/abs/2510.22316))
100+
- **[ICDE'26]** Qiji Mo, Zhiyuan Hua, Zebin Yao, Lixiao Cui, Xiaoguang Liu, Gang Wang, Zijing Wei, Xinyu Liu, Tianxiao Tang, Shaozhi Liu, Lin Qu. *Overcoming the Sync-Compute Dilemma in Parallel Graph-Based Vector Retrieval.* IEEE International Conference on Data Engineering, 2026.
101+
102+
## Next steps
103+
104+
- [Python quick start](../PythonQuickStart.md) — run the full build → dump → open → search flow in Python.
105+
- [DiskANN tuning guide](./DiskANNParameters.md) — graph build and search parameter tuning for DiskANN.
106+
- [Options reference](./OptionsReference.md) — complete list of configuration keys.

third_party/lumina/reference/QuantizationParameters.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,4 +112,4 @@ Current behavior:
112112

113113
## Status
114114

115-
v0.2.1 Release Tag (2026-04-07).
115+
v0.2.2 Release Tag (2026-05-14).
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# v0.2.2
2+
3+
## Overview
4+
5+
This is a patch release focusing on **bug fixes** for DiskANN and RaBitQ backends. No new features are introduced.
6+
7+
## Audience
8+
9+
- Users integrating Lumina via the public C++ API or Python interface.
10+
- Users building indexes offline and serving online search.
11+
12+
## Status
13+
14+
Stable (2026-05-14).
15+
16+
## Compatibility
17+
18+
- **Public API**: stable within `include/lumina/api/**` following Lumina versioning policy.
19+
- **Source compatibility**: guaranteed within the compatible range (recompile required). ABI compatibility is not
20+
promised.
21+
- **On-disk formats**:
22+
- The `.lmi` container format is versioned (current version: `0`) and uses CRC32C for corruption detection.
23+
- Backend-specific layouts are excluded from long-term compatibility promises unless explicitly declared stable.
24+
- IVF snapshot layout is experimental.
25+
- **Extensions**:
26+
- `SearchWithFilterExtension` is stable and supported by `diskann` and `bruteforce` searchers (not supported by `ivf`).
27+
- Checkpoint extension is experimental, backend-specific, and excluded from long-term compatibility promises.
28+
- `GetVectorExtension` is experimental and supported only by the `bruteforce` searcher with `rawf32` encoding.
29+
- **Python**: experimental interface; the API may change across versions and is not covered by stability promises.
30+
31+
## Changes
32+
33+
- Bug Fixes:
34+
- RaBitQ: fix 1-bit rabitq query recall too low (#82050746).
35+
- DiskANNBackend: fix build graph may not be connected (#82043000).
36+
- DiskANNBackend: fix integer division by zero (#81832916).
37+
- DiskANNBackend: fix io_limit set up, io_limit is no less than topk and no greater than vector count (#81585231).
38+
39+
## Migration Notes
40+
41+
- No breaking changes. Drop-in replacement for v0.2.1.
42+
43+
## Known Issues
44+
45+
- IVF: only `L2` metric is currently supported.
46+
- DiskANN: dynamic updates (incremental insert/delete) are still not supported.
47+
- Builder: `LuminaBuilder` instances remain not thread-safe.
48+
- IO: `.lmi` and backend layouts are still evolving, so upgrades may require rebuilding indexes.
49+
- See [Limitations](../reference/Limitations.md) for the complete list.

0 commit comments

Comments
 (0)