[RFC] Search for context aware segments

## 1. Overview

[Context aware segments (CAS)](https://github.com/opensearch-project/opensearch/issues/18576) groups segments based on a grouping criteria function. This function can simply return a field or apply a transform on a criteria field to be stored as grouping context. This grouping information is stored in an attribute in segment info. There is no overlap of data among these groups of segments. CAS allows users to optimize queries by reducing the search space to specific set of segments based on provided criteria function. This RFC discusses approaches that enables Opensearch to search based on a user provided context. 

## 2. Problem statement

The existing implementation of search iterates through all the segments in an index and executes search on each of them before combining the results. With CAS storing context information in each segment,  executing search on all the segments is wasteful when search space can be narrowed to a set of segments in a group. The goal with supporting context aware search is to give users results in an efficient way whenever possible

## 3. Context based search

On a high level, the problem can be broken down into two parts:

1. Extracting the criteria from the query: This will help identify the context that is stored in the segment info
2. Identifying the segments based on the context obtained from the query

### 3.1 Extracting criteria  from the query

OpenSearch supports various query types. CAS is designed to primarily leverage `bool` queries, allowing users to use a single clause or a combination of clauses to narrow a search to specific criteria.

For simplicity, the initial proposal limits support to `term` and `range` queries. Support for other query types can be added iteratively. Moreover, in the initial iteration, the criteria function can only be based on a single field in the mapping.

To reduce the search space, the criteria field is extracted from the query before the query phase begins. This extraction process uses QueryBuilders and recursion, evaluating each QueryBuilder to find the value of the specified field. A current limitation is that this process only handles a single field.
The logic for handling clauses is as follows:

`FILTER`: If any criteria value is found while evaluating a filter clause, the remaining clauses are ignored, and the search space is pruned based on the values from the filter clause.

`MUST`: If multiple must clauses return context values, the results are discarded due to ambiguity. If only a single must clause returns a context value, that value is used. If a criteria value is found in a must clause, any should clauses are ignored.

`SHOULD`: All criteria values found in should clauses are returned.

Below is the psuedo code which supports TermQueryBuilder, range support will be added in the actual 
implementation

```java
public static Set<String> extractCriteria(QueryBuilder queryBuilder, String criteria) {
        if (criteria == null) {
            return Collections.emptySet();
        }

        if (queryBuilder instanceof TermQueryBuilder) {
            final TermQueryBuilder termQueryBuilder = (TermQueryBuilder) queryBuilder;
            if (termQueryBuilder.fieldName().equals(criteria)) {
                return Set.of(termQueryBuilder.value().toString());
            }
            return Collections.emptySet();
        }

        if (queryBuilder instanceof BoolQueryBuilder) {
            BoolQueryBuilder boolQueryBuilder = (BoolQueryBuilder) queryBuilder;
            List<QueryBuilder> filter = boolQueryBuilder.filter();
            Set<String> filterCriterias = new HashSet<>();
            for (QueryBuilder query : filter) {
                Set<String> criterias = extractCriteria(query, criteria);
                if (filterCriterias.isEmpty()) {
                    filterCriterias.addAll(criterias);
                } else {
                   return Collections.emptySet();
                }
            }

            if (!filterCriterias.isEmpty()) {
                return filterCriterias;
            }

            final Set<String> mustCriterias = new HashSet<>();
            List<QueryBuilder> mustClauses = boolQueryBuilder.must();
            for (QueryBuilder query : mustClauses) {
                Set<String> criterias = extractCriteria(query, criteria);
                if (mustCriterias.isEmpty()) {
                    mustCriterias.addAll(criterias);
                } else {
                    if (!(criterias.isEmpty() || mustCriterias.equals(criterias))) {
                        return Collections.emptySet();
                    }
                }
            }

            if (!mustCriterias.isEmpty()) {
                return mustCriterias;
            }

            final Set<String> shouldCriterias = new HashSet<>();
            String minshouldMatchString = boolQueryBuilder.minimumShouldMatch();
            int minimumShouldMatch = minshouldMatchString == null ? 1 : Integer.parseInt(minshouldMatchString);
            if (mustCriterias.isEmpty() && minimumShouldMatch > 0) {
                final List<QueryBuilder> shouldClauses = boolQueryBuilder.should();
                for (QueryBuilder query : shouldClauses) {
                    Set<String> shouldCriteria = extractCriteria(query, criteria);
                    if (shouldCriteria.isEmpty()) {
                        return Collections.emptySet();
                    }
                    shouldCriterias.addAll(shouldCriteria);
                }
            }

            if (!shouldCriterias.isEmpty()) {
                return shouldCriterias;
            }
        }

        return Collections.emptySet();
    }
```

### 3.2 Identifying segments for criteria

Multiple options are considered for this. Recommended is 3.2.2

#### 3.2.1 Filtering the segments per search request

This approach aims to filter out segments based on specific criteria while iterating through them for a search. Two implementations must be considered: default search and concurrent search.

For both cases, there are limited ways to pass the request criteria, since the implementation is tied to the IndexSearcher interface.

- Criteria as a member variable in `ContextIndexSearcher` (Recommended): During search context creation, each request creates a new ContextIndexSearcher object. The criteria field is set at this time and is subsequently used to filter segments when the search is executed.

- Criteria as a Member Variable in TopDocsCollector: This collector is passed as a parameter during search execution. This approach would require introducing an additional member variable to the collector.

**Default Search**

![Image](https://github.com/user-attachments/assets/900f54c4-7b4d-47e4-ac18-d12e4f4bf15d)

Default search has an existing `canMatch` function that can be reused to implement logic that validates the criteria of a segment. This method still evaluates all segments but only executes the search on those that meet the extracted criteria.

**Concurrent Search**
Concurrent search has an additional step of slicing leaves into multiple LeafSlice objects. The slicing algorithm, which groups leaves (segments) from the IndexReader, is used to control concurrency. This balances the latency benefits of aggressive concurrency against its CPU utilization.

![Image](https://github.com/user-attachments/assets/3dec8287-f86f-4701-ac44-d060334855b3)

Similar to default search, the canMatch execution occurs after the slicing algorithm runs. However, considering all leaves for slicing when a search will only be executed on a handful of segments can cause unexpected concurrency behavior. Therefore, for the context-aware segment use case, the criteria filtering must be included within the slicing algorithm. This ensures the slicing algorithm only considers the relevant subset of segments when grouping them into LeafSlice objects.

**Other Challenges**
Some vector search use cases, such as rescoring, break the standard execution pattern of using `IndexSearcher#createWeight` and `IndexSearcher#createScorer`. Instead, they use `IndexSearcher#rewrite` to execute the query, which allows for result aggregation at the shard level.

With the proposed approach, filtering occurs after the rewrite and createWeight steps. As a side effect, queries that execute during the rewrite phase will need to implement their own criteria-based filtering logic.

_Pros:_
- Uses the existing execution flow, making it simple and straightforward.

_Cons:_
- Unnecessarily loops through all segments with each request, which could increase latency.
- It is not a centralized solution, as some query types will not benefit from it out-of-the-box and will require separate implementations.

#### 3.2.2. Creating criteria based IndexReaders which hold filtered segments per criteria [Recommended]

This approach provides the ContextIndexSearcher with an IndexReader implementation containing leaves that are pre-filtered by criteria. This eliminates the need for runtime filtering. The filtering of segments is moved to the refresh cycle, which creates a data structure to hold segments grouped by criteria. The sequence diagram below illustrates the relevant flow of a search request to better understand this approach.

![Image](https://github.com/user-attachments/assets/ff9b113a-16bb-4717-b676-6e54ee263640)

Currently, the Engine uses `OpenSearchDirectoryReader` as its `IndexReader`; this reader holds all the segments in the shard. This `OpenSearchDirectoryReader` is used to create an `Engine.Searcher`, which in turn is used to create a `ContextIndexSearcher`.

**CriteriaBasedDirectoryReader**
This approach introduces a new DirectoryReader implementation: CriteriaBasedDirectoryReader, a logical construct that holds the leaves for a single criterion. The CriteriaBasedDirectoryReader needs to be wrapped by an `OpenSearchDirectoryReader` to support caching and existing `instanceof` checks. The following is a snippet of the class:

```java
public class CriteriaBasedReader extends DirectoryReader {

        private String criteria;
        private CacheHelper cacheHelper;

        protected CriteriaBasedReader(Directory directory, // same as parent
                    LeafReader[] subReaders, //filtered leaves
                    CacheHelper cacheHelper, // same as parent
                    String criteria) throws IOException {// tenant id
            
            super(directory, subReaders, null);
            this.criteria = criteria;
            this.cacheHelper = cacheHelper;
        }

        public String getCriteria() {
            return this.criteria;
        }
        ....
}
```

**Instantiation of CriteriaBasedDirectoryReader during refresh**

This reader is created and wrapped by OpenSearchDirectoryReader during the refresh cycle. The creation of the reader was not chosen to occur at search request runtime for the following reasons:
- To eliminate the latency concerns of the ‘Filtering the segments per search request’ approach.
- To support request caching (more on this later).

If required, the refresh process creates a new OpenSearchDirectoryReader object with each execution. During the constructor's execution, we group all leaves into a map data structure and create a corresponding `CriteriaBasedDirectoryReader` for each criterion. The lifecycle of this map is tied to the lifecycle of the OpenSearchDirectoryReader object. A new method is needed in `OpenSearchDirectoryReader` to access the reader from the map.

Note that the leaves stored in the map are just references and will be closed when the parent directory reader is closed.

Relevant code snippet 
```java
    private OpenSearchDirectoryReader(DirectoryReader in, FilterDirectoryReader.SubReaderWrapper wrapper, ShardId shardId)
        throws IOException {
        super(in, wrapper);
        this.wrapper = wrapper;
        this.shardId = shardId;
        this.delegatingCacheHelper = new DelegatingCacheHelper(in.getReaderCacheHelper());
        if (!(in instanceof CriteriaBasedReader)) {
            warmUpCriteriaBasedReader(in.leaves());
        }
    }

    private void warmUpCriteriaBasedReader(final List<LeafReaderContext> leaves) throws IOException { 
        Map<String, List<LeafReader>> criteriaBasedLeafReaders = new HashMap<>();
        for (LeafReaderContext leafReaderContext : leaves) {
            final String criteria = Lucene.segmentReader(leafReaderContext.reader())
                .getSegmentInfo().info.getAttribute("criteria");
            if (criteria != null) {
                criteriaBasedLeafReaders.computeIfAbsent(criteria, k -> new ArrayList<>())
                    .add(leafReaderContext.reader());
            }
        }

        //Important to create readers here and not at runtime to make sure cachkey does not change
        for (Map.Entry<String, List<LeafReader>> entry : criteriaBasedLeafReaders.entrySet()) {
            final String criteria = entry.getKey();
            final CriteriaBasedReader criteriaBasedreader =
                new CriteriaBasedReader(directory, entry.getValue().toArray(new LeafReader[0]), in.getReaderCacheHelper(), criteria);
            criteriaBasedReaders.put(criteria,
                new OpenSearchDirectoryReader
                    (criteriaBasedreader, new FilterDirectoryReader.SubReaderWrapper() {
                        @Override
                        public LeafReader wrap(LeafReader reader) {
                            return reader;
                        }
                    }, shardId()));
        }
    }
    
    public OpenSearchDirectoryReader getCriteriaBasedReader(String criteria) throws IOException {
        OpenSearchDirectoryReader criteriaBasedReader = criteriaBasedReaders.get(criteria);
        if (criteriaBasedReader == null) {
            // Return empty reader. TODO: need to give a constant cache key here 
            return new OpenSearchDirectoryReader(new CriteriaBasedReader(directory, 
                       emptyList().toArray(new LeafReader[0]),
                       in.getReaderCacheHelper(), criteria),
                new FilterDirectoryReader.SubReaderWrapper() {
                    @Override
                    public LeafReader wrap(LeafReader reader) {
                        return reader;
                    }
                },
                shardId());
        }
        return criteriaBasedReader;
    }
```
**Request caching**
Cache key for request caching contains:

* ShardId
* Query source bytes: Query string converted to bytes
* Reader cache key: A uuid generated in OpensearchDirectoryReader
* Hashcode of index shard

To cache the request for a given criteria, the reader cache key must not change between requests. This requires a constant cache key for each CriteriaBasedDirectoryReader. This is achieved by wrapping the reader with OpenSearchDirectoryReader and storing it in a data structure to ensure the same reader instance is used for each request. The refresh cycle is the most logical place to manage this, ensuring readers are updated only when necessary.

POC code: https://github.com/shatejas/OpenSearch/commits/multi-iw-2/?since=2025-06-17&until=2025-07-03

Pros:

* Eliminates latency impact as filtering done on refresh
* Out of the box experience for all Query implementations

Cons:

* Introduces a new logical directory reader implementation

**Score differences**

With this approach, there are differences in BM25 score calculations. This is because it considers inverse document frequency (IDF) and term frequencies (TF). TF and IDF are both dependent on the search space considered, pruning the search space reduces the overall scores of hits.

* IDF considers the number of occurrences of the term (n) across all documents as well as total number (N) of documents.
* TF considers frequency of the term (freq), length (dl) and average length (avgdl) 

https://github.com/opensearch-project/opensearch/issues/18807

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Search for context aware segments #19093

1. Overview

2. Problem statement

3. Context based search

3.1 Extracting criteria from the query

3.2 Identifying segments for criteria

3.2.1 Filtering the segments per search request

3.2.2. Creating criteria based IndexReaders which hold filtered segments per criteria [Recommended]

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Search for context aware segments #19093

Description

1. Overview

2. Problem statement

3. Context based search

3.1 Extracting criteria from the query

3.2 Identifying segments for criteria

3.2.1 Filtering the segments per search request

3.2.2. Creating criteria based IndexReaders which hold filtered segments per criteria [Recommended]

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions