-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Background
Lucene improvements introduced skiplist for DocValues, prominently improving the performance of queries that use SortedNumericDocValuesField.newSlowExactQuery, especially for low-selectivity (rare match) cases. Analysis on Lucene show that enabling the skiplist can make these queries up to 60x faster by skipping over large blocks of irrelevant documents.
How the Skiplist Works in Lucene
A skiplist is a sparse index that sits on top of DocValues to enable efficient document skipping. It enhances the existing DocValues infrastructure by adding block-level statistics that enable efficient skipping.
How It Works
Multi-Level Structure
Level 0: Every 4,096 documents, collect stats (min/max values, docID ranges)
Level 1: Every 8 Level 0 intervals, aggregate stats from underlying blocks
Level 2: Every 8 Level 1 intervals, aggregate stats from Level 1 blocks
Level 3: Every 8 Level 2 intervals, aggregate stats from Level 2 blocks
Skipping Mechanism
Start at highest level and check if target value could be in current block
If block range excludes target, skip entire block and all sub-blocks
If block might contain target, descend to next level and repeat
At lowest level, either process the block or skip based on final stats
Current State in OpenSearch
In OpenSearch, the following classes utilize SortedNumericDocValuesField.newSlowExactQuery:
BooleanFieldMapper.BooleanFieldType#termQueryNumberFieldMapper.NumberFieldType#termQuery(and related numeric types)
Integration in OpenSearch
In the OpenSearch field mappers (BooleanFieldMapper, NumberFieldMapper), the changes involve:
Before (No Skiplist)
// Old approach: basic DocValues field without skiplist
context.doc().add(new SortedNumericDocValuesField(fieldType().name(), value));After (With Skiplist)
// New approach: uses indexedField method for skiplist support
context.doc().add(SortedNumericDocValuesField.indexedField(fieldType().name(), value));The SortedNumericDocValuesField.indexedField() method is a Lucene API that creates a DocValues field with skiplist support enabled.
indexedField()ensures the field is created with skiplist metadata during indexingnewSlowExactQuery()can then leverage this metadata for efficient query execution- Together, they enable the full skiplist optimization pipeline
What Was Changed
I have added logic to these mappers to utilize the skiplist-enabled indexed field methods, so that when skiplist is available in Lucene, OpenSearch can take advantage of the performance boost.
Performance Chain
- Indexing:
indexedField()creates fields with skiplist metadata - Query Planning:
newSlowExactQuery()detects skiplist availability - Query Execution: Uses
DocValuesSkipperto efficiently skip irrelevant blocks - Result: Improved performance for selective queries
Why This Matters
- Performance:
- For low-selectivity queries (rare values), the skiplist allows skipping most blocks, reducing query time from linear to sub-linear.
- Analysis done on Lucene show up to 60x speedup for
newSlowExactQuerywith skiplist enabled.
References
- [Lucene PR: Add levels to DocValues skipper index](Add levels to DocValues skipper index apache/lucene#13563
- Use lucene sparse index in opensearch
Example Table (from 17710)
| Selectability | Cardinality | newSlowExactQuery (SkipList=Disabled) | newSlowExactQuery (SkipList=Enabled) | Speedup |
|---|---|---|---|---|
| 2 | 10,000,000 | 197.67 | 3.33 | 59x |
| 10 | 10,000,000 | 177.67 | 8.67 | 20x |
| ... | ... | ... | ... | ... |
Summary
Enabling skiplist for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch provides a major performance boost for DocValues-based queries, especially for rare values. This change leverages recent Lucene improvements.
Describe the solution you'd like
Review and merge the changes that enable skiplist usage in the relevant field mappers.
Related component
Indexing
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status