Skip to content

Enable Skiplist Optimization for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch #18751

@shreyah963

Description

@shreyah963

Background

Lucene improvements introduced skiplist for DocValues, prominently improving the performance of queries that use SortedNumericDocValuesField.newSlowExactQuery, especially for low-selectivity (rare match) cases. Analysis on Lucene show that enabling the skiplist can make these queries up to 60x faster by skipping over large blocks of irrelevant documents.

How the Skiplist Works in Lucene

A skiplist is a sparse index that sits on top of DocValues to enable efficient document skipping. It enhances the existing DocValues infrastructure by adding block-level statistics that enable efficient skipping.

How It Works

Multi-Level Structure
Level 0: Every 4,096 documents, collect stats (min/max values, docID ranges)
Level 1: Every 8 Level 0 intervals, aggregate stats from underlying blocks
Level 2: Every 8 Level 1 intervals, aggregate stats from Level 1 blocks
Level 3: Every 8 Level 2 intervals, aggregate stats from Level 2 blocks

Skipping Mechanism

Start at highest level and check if target value could be in current block
If block range excludes target, skip entire block and all sub-blocks
If block might contain target, descend to next level and repeat
At lowest level, either process the block or skip based on final stats

Current State in OpenSearch

In OpenSearch, the following classes utilize SortedNumericDocValuesField.newSlowExactQuery:

  • BooleanFieldMapper.BooleanFieldType#termQuery
  • NumberFieldMapper.NumberFieldType#termQuery (and related numeric types)

Integration in OpenSearch

In the OpenSearch field mappers (BooleanFieldMapper, NumberFieldMapper), the changes involve:

Before (No Skiplist)

// Old approach: basic DocValues field without skiplist
context.doc().add(new SortedNumericDocValuesField(fieldType().name(), value));

After (With Skiplist)

// New approach: uses indexedField method for skiplist support
context.doc().add(SortedNumericDocValuesField.indexedField(fieldType().name(), value));

The SortedNumericDocValuesField.indexedField() method is a Lucene API that creates a DocValues field with skiplist support enabled.

  • indexedField() ensures the field is created with skiplist metadata during indexing
  • newSlowExactQuery() can then leverage this metadata for efficient query execution
  • Together, they enable the full skiplist optimization pipeline

What Was Changed

I have added logic to these mappers to utilize the skiplist-enabled indexed field methods, so that when skiplist is available in Lucene, OpenSearch can take advantage of the performance boost.

Performance Chain

  1. Indexing: indexedField() creates fields with skiplist metadata
  2. Query Planning: newSlowExactQuery() detects skiplist availability
  3. Query Execution: Uses DocValuesSkipper to efficiently skip irrelevant blocks
  4. Result: Improved performance for selective queries

Why This Matters

  • Performance:
    • For low-selectivity queries (rare values), the skiplist allows skipping most blocks, reducing query time from linear to sub-linear.
    • Analysis done on Lucene show up to 60x speedup for newSlowExactQuery with skiplist enabled.

References


Example Table (from 17710)

Selectability Cardinality newSlowExactQuery (SkipList=Disabled) newSlowExactQuery (SkipList=Enabled) Speedup
2 10,000,000 197.67 3.33 59x
10 10,000,000 177.67 8.67 20x
... ... ... ... ...

Summary

Enabling skiplist for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch provides a major performance boost for DocValues-based queries, especially for rare values. This change leverages recent Lucene improvements.

Describe the solution you'd like

Review and merge the changes that enable skiplist usage in the relevant field mappers.

Related component

Indexing

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

IndexingIndexing, Bulk Indexing and anything related to indexingIndexing:PerformancePerformanceThis is for any performance related enhancements or bugsenhancementEnhancement or improvement to existing feature or requestlucene

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions