Skip to content

[RFC] HLD for Dynamic Weight Optimization for Hybrid Search #223

@martin-gaievski

Description

@martin-gaievski

Dynamic Weight Optimization for Hybrid Search

Introduction

Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In 2.11, the team has release hybrid query that is part of the neural-search plugin. Main responsibility of the hybrid query is to return scores of multiple queries that are normalized and combined.

Main way of improving relevance of the hybrid search results is through sub-query weights. By assigning greater or less coefficient to lexical and semantic sub-queries we can increase or decrease their respective contribution to the final combined document score. Initially, identifying weights is the user's responsibility. The Search Relevance Workbench introduced a straightforward approach for identifying suitable hybrid search parameters by trying out a hard-coded set of different alternatives with a predefined query set and hybrid search configuration.

Problem Statement

With hybrid experiment users can identify optimal weights that are best on average for the whole set of documents and queries. However, this approach produces exactly one parameter combination as “the best” one for all queries. With this parameter combination, some queries will benefit (as in their search quality metrics improve) and other queries will not benefit (as in their search quality metrics decrease) as described in the “Hybrid Search Optimization” blog post.

This RFC proposes a framework for dynamic weight optimization that can predict query-specific weights. The initial implementation establishes the foundation for this capability, with the expectation that future iterations will improve prediction accuracy through enhanced features and models.

Requirements

Dynamic hybrid search optimization is a search relevance tuning operation for advanced users that requires machine learning knowledge, specifically around feature selection and engineering processes and model training.

Functional Requirements

  • system should predict weights for lexical and semantic parts of hybrid query at “per query” level
  • weight prediction is made based on the query text, we’re adding limitation that such text must be identical among all sub-queries
  • fallback to pre-configured global optimization weights in case of dynamic weights are not available
  • framework shall support extensible feature engineering for future enhancements

Non functional requirements

  • minimize added latency. target is <10ms additional latency for weight prediction measured by 95th percentile query latency increase
  • provide clear extension points for improved models in future versions

Out of Scope

  • details of how scores are normalized and combined for hybrid query, we are using existing OpenSearch techniques
  • details of training of ML models involved into weights prediction. We care about them as building blocks for our solution and threat then as black boxes

Current State

Currently weights for score combination are predicted globally for the whole dataset based on the average relevance metrics from the test query set.

Image

Solution Overview

To achieve best results in terms of relevance we propose to use ML techniques for weight’s prediction. Such predicting has the requirements that all ML-powered applications have:

  • Feature selection and feature engineering
  • Model training and serving
  • Model inference

This proposal introduces a foundational framework for ML-based weight prediction in hybrid search. The framework supports:

  • extensible feature engineering - starting with basic query features
  • pluggable model architecture - beginning with linear regression
  • scalable inference pipeline - designed for future enhancements

Version 1 Focus: Establish the core framework and processing pipeline, providing a baseline for future improvements rather than optimizing for immediate performance gains.

Following diagram shows high level flow that is needed to train weight prediction model and use it in hybrid queries.

Image

Following main components are part of the future framework

Feature engineering pipeline
The framework provides a standardized feature extraction system for hybrid queries:

  • query-level features: Length, token count, presence of numbers/special characters
  • result-level features: BM25 scores, semantic similarity scores, result counts
  • extensibility: Clear interfaces for adding domain-specific features in future versions

Model integration architecture
Version 1 supports embedded linear regression models with:

  • co-located processing: Models execute within the search pipeline for minimal latency
  • cluster state storage: Model parameters stored as lightweight cluster metadata
  • fallback mechanism: Automatic reversion to static weights when prediction fails

Weight application system
Predicted weights integrate with existing hybrid search components:

  • normalization processor enhancement: Accepts dynamic weights alongside static configuration
  • score combination: Applies predicted weights during arithmetic mean calculation
  • query DSL compatibility: Works with existing hybrid query syntax

Option 1: Embedded scorer [Recommended]

Model is trained inside OpenSearch. Trained model parameters are stored separately and model logic implemented as Java code. Combined logic and model parameters is what we call embedded scorer.

Other component of this solution:

  • normalization processor accepts weights for combination as dynamic parameters
  • need categorization mechanism for query type: lexical/semantic
  • predefined query template to make sure same query test between sub-queries

Following are query features we use for model training

  • Basic features
    • query length
    • token count
    • has numbers (boolean)
    • has special characters (boolean)
  • Lexical search result features
    • number of results for the lexical query.
    • maximum title score: maximum score of the titles of the retrieved top 10 documents. The scores are BM25 scores calculated individually per result set. That means that the BM25 score is not calculated on the whole index but only on the retrieved subset for the query, making the scores more comparable to each other and less prone to outliers that could result from high IDF values for very rare query terms.
    • sum of the title scores of the top 10 documents, again calculated per result set. We use the sum of the scores (and no average value) as an aggregate to measure how relevant all retrieved top 10 titles are. BM25 scores are not normalized, so using the sum instead of the average seemed reasonable.
  • Neural search result features
    • maximum semantic score of the retrieved top 10 documents. This is the score we receive for a neural query based on the query’s similarity to the title.
    • average semantic score: In contrast to BM25 scores, the semantic scores are normalized and in the range of 0 to 1. Using the average score seems more reasonable than attempting to calculate the sum.
  • Other less common domain specific features, evaluation is need if those features are effective and can be collected from dataset: currency, size, SKU, is question, is medical acronym, has citation, is stock ticker, has price

High level workflow

  • Upload data set [Core OpenSearch]
  • Add user queries, OpenSearch DSL query and judgments to search relevance workbench. They are stored as query set, search configuration and judgment ratings. [Search Relevance Workbench]
  • Train hybrid query weights prediction model, pass query set, search configuration and judgment ratings ids. Model metadata is stored as part of the search relevance internal index (or maybe we just keep it in memory). This should be exportable json-like document [Search Relevance Workbench]
  • Import model parameters and store them as part of the cluster state. Create embedded scorer using those stored parameters. [Core OpenSearch]
  • At query time identify if query needs dynamic optimization, call embedded scorer (read model parameters) and apply weights to hybrid query [Neural search]
Image

Pros:

  • fast due to co-location with neural-search plugin code, no transport calls and de/serializations
  • no model deployment and connector needed, which is important for managed cloud environments with limited extensibility
  • depending on how model parameters are stored, they can be manually editable with no need to retrain the model

Cons:

  • limited set of query features are supported (only queries defined as part of model training)
  • limited model types are supported due to complexity of internal model logic. linear regression
  • need separate persistent mechanism to store data extracted from model
  • less error prone comparing to pre-trained models because we need new component that receives the model and does the calculations
  • categorization mechanism for query type (lexical/semantic) is limited
  • limitations on query text variability (text must be same between sub-queries)

Option 2: External simple model

Similar to Option 1 except the weights predicting model is accessed via ml-commons. Model can be simple like linear regression, that is deployed locally, or larger LLM that hosted remotely.

Image

Pros:

  • flexibility, virtually any model type is supported
  • more error prone, no need in extra steps of converting model to java code (no embedded scorer) and store model parameters in OpenSearch
  • simpler implementation, higher use of existing components

Cons:

  • extra latency due to remote predict calls to model
  • limited set of query features are supported (only queries defined as part of model training)
  • categorization mechanism for query type (lexical/semantic) is limited
  • limitations on query text variability (text must be same between sub-queries)
  • need extra setup for model connector
  • may not work in restricted deployment environments due to external model hosting requirements

Option 3: External LLM

This option is making next step comparing to Option 2 - instead of simple model trained on exact dataset using query features we can use LLM and throw the whole query text there.

Image

Other component of this solution:

  • normalization processor accepts weights for combination as dynamic parameters
  • predefined query template to make sure same query test between sub-queries
  • prompt for LLM

Pros:

  • flexibility, virtually any model type is supported
  • simplest option: no need to train model, no need in extra steps of converting model to java code (no embedded scorer), no need to store model detail in OpenSearch
  • less dependent on query text features
  • potentially we can predict which techniques provide the best relevance

Cons:

  • extra latency due to remote predict calls to model, presumably higher then in Option 2 (100+ ms)
  • potentially limited throughput, model can throttle requests due to high resource utilization
  • limitations on query text variability (text must be same between sub-queries)
  • need extra setup for model connector
  • may not work in restricted deployment environments due to external model hosting requirements

Solution Comparison

Solutions are offering a tradeoff between flexibility and performance.

Solutions for dynamic optimizer - comparison table

Criteria Option 1: Embedded Scorer (Recommended) Option 2: External Simple Model Option 3: External LLM
Performance characteristics      
Latency Low - Co-located with neural-search plugin Medium - Network calls required High - LLM inference time plus network overhead
Throughput High Medium - Limited by external service Low - Potential throttling from LLM service
Resource utilization Low - Minimal overhead Medium High - LLMs require significant resources
       
Implementation      
Complexity Medium - Need to convert models to Java code Low - Uses standard model interfaces Low - Uses standard LLM APIs
Model types supported Limited - Primarily linear regression High - Any supported model type High - LLMs with prompt engineering
Feature engineering effort High - Careful feature selection needed High - Same as Option 1 Low - LLM can process raw queries
       
Operational considerations      
Managed environment compatibility Yes Limited - Depends on connector Limited - Depends on connector
External dependencies None Required - Model hosting service Required - LLM API service
Model management Complex - Need persistence mechanism Simple - Managed externally Simple - Managed externally
Infrastructure requirements Minimal Moderate - Model hosting High - LLM infrastructure
       
Capabilities      
Model sophistication Basic Moderate Advanced
Adaptability to query variations Limited Limited High - LLMs handle text variations well
Contextualization Low Low High - Can understand query intent
Feature utilization Limited to engineered features Limited to engineered features Can extract features from raw text
       
Constraints      
Query text consistency requirements High - Text must be same between sub-queries High - Text must be same between sub-queries Medium - More tolerant of variations
Setup complexity Low Medium - Requires connector setup High - LLM integration and prompt engineering
Maintainability Medium - Need to update embedded code High - External model updates are seamless High - LLM updates managed by provider
Error handling complexity High - Internal errors harder to debug Medium Medium

Based on how well each solution fits the criteria categories we can conclude these recommendations:

Option 1 (Embedded Scorer) is recommended for most use cases due to:

  • best performance characteristics with minimal latency
  • no external dependencies making it compatible with all deployment scenarios including managed cloud environments
  • simplest operational deployment

Option 2 can be reconsidered when:

  • more sophisticated models beyond linear regression are required
  • external model management infrastructure already exists
  • performance is not the primary concern

Option 3 can be reconsidered when:

  • query variations are significant
  • deep understanding of query semantics is required
  • performance can be traded for higher accuracy
  • external LLM infrastructure is already in place

Key Design Decisions

All following decisions are for recommended solution option.

  1. How model data is stored

We can use cluster state. Model metadata is relatively small (few Kb, liner regression model for ESCI dataset was 880 bytes). This storage survives node crush and cluster restart. Can be retrieved and tweaked by user if needed.

  1. How identify sub-query class

We can create a registry of the queries and corresponding types, e.g. match → lexical, neural/knn - semantic etc. In case of compound or complex query we ignore dynamic optimization and fall back to static weights. Another option to explore is register type for each query class and explore it with visitor pattern.

  1. How extract query text from OpenSearch query DSL

Registry of query types with keywords where the query text can be extracted. We fall back to static weights or fail in case there are multiple different query texts or some unknown query type.

  1. How compare relevance metrics during model training

We rely on user provided judgments for dataset and queries. Any document query pair that does not have judgment rating considered to by irrelevant (effectively judgment rating is 0.0). If judgments are missing user can generate them using Search Relevance Workbench and generate LLM judgment functionality.

Open Questions

Which ML simple model is most effective

For initial version we need to pick one model type that is:

  • relatively simple to be converted into Java code
  • provide most relevant results for random/general datasets

While doing POC I tested following models using ESCI dataset

  • linear regression
  • logistic regression
  • gradient_boosting
  • random forest (tested for comparison, will be hard to convert to Java)

Following table represents data collected out of that POC

Model Type Accuracy (NDCG@10) Training Time Inference Latency Interpretability Implementation Complexity POC Suitability
Linear Regression 0.82 <1 sec <5ms High Simple Excellent
Random Forest 0.87 5-10 sec 15-20ms Medium Moderate Good
Neural Network 0.89 30-60 sec 25-30ms Low Complex Poor
XGBoost 0.88 10-15 sec 10-15ms Low-Medium Moderate Fair

Short Term/Mid Term/Long Term implementation

In Short term we can start with Option 1 implementation where model is stored locally in the cluster. Few other limited solution options that make sense for Short term implementation:

  • Use only basic query features (only those that can be extracted from the query text itself)
  • Model type for embedded scorer is fixed, exact type will be identified based on benchmark data
  • Judgment ratings (aka ground truth) are provided by user, we rely on quality of those judgments

In Mid/Long term we will add Option 2 as additional mode of dynamic optimization. This should increase variety of supported models for advanced users. Such change should be backward compatible, but will have limited support in Serverless. More features that are planned for later phases:

  • complex query features for embedded scorer model

Potential Issues

Known limitations and Future extensions

With recommended solution options following limitations can be assumed

  • support for limited model types trained on query features:
    • Linear Regression
    • Logistic Regression
    • Polynomial Regression
    • Ridge/Lasso Regression
    • Simple Decision Trees

Solution LLD

Frontend

We need a new screen in Search Relevance Workbench to start model training. User needs to input following information:

  • index (existing OpenSearch index with ingested data)
  • id for following entities that needs to be imported beforehand
    • user queries
    • search configuration
    • judgments
  • any model related information (may be not needed if we go with a simplest for of using a single model type)

Optimal way is to re-use existing Hybrid Search Optimizer Experiment screen. We can add “Optimization” mode section with two mutual exclusive options - “Global” which is what we have today and it will be selected by default. Second option is new “Dynamic” mode.

Following are mocks for new UI.

This would be Hybrid Search Optimizer Experiment initial screen. “Global” optimization mode has been pre-selected.

Image

This is how screen changes when user selects Dynamic mode for optimization:

Image

Backend

In Search Relevance Workbench backend we need to add following components:

  • modify Experiment API in Search Relevance Workbench for training the model. This is async API already, and this is perfect fit because model training can be long running (~10 mins for linear regression model used in POC) and most likely times out. Model parameters are stored in the cluster metadata at the end of the training. We keep minimal record in Experiment index to allow user monitor the training progress.
  • new search processor that will identify if incoming query is hybrid query with dynamic optimization flag, and in such scenario will extract query features and call embedded scorer to predict weights. Those weights are set in context of pipeline
  • modifications in existing normalization processor, it needs to read the predicted weights and apply them in the process of score normalization and combination. If for some reasons that cannot be done system falls back to static weights provided as part of the pipeline.

Following are details for each of those initial version items

Model training

In Search Relevance Workbench backend we use existing API experiments.

for a simple case in initial version we can use simplified format omitting parameters that have only one possible value

PUT /_plugins/search_relevance/experiments
{
    "querySetId": "{{query_set_id}}",
    "searchConfigurationList": ["{{hybrid_search_config_id}}"],
    "size": 10,
    "judgmentList": ["{{judgment_list_id_1}}"],
    "type": "HYBRID_OPTIMIZER", 
    "optimizationMode": "dynamic"
}
Parameter name Type Description Default value
optimizationMode keyword defines the experiment type, allowed values: global dynamic

Sample response

{
    "experimentId": "{{experimentId}}",
    "modelId": "{{gneratedModelId}}"
    "status": "CREATED"
}

To effectively run model training we need to do following steps:

  • split training workload into reasonably small tasks
  • run few tasks in parallel and schedule the rest of tasks using task queue
  • keep draining that task queue until all tasks are executed
  • finalize training results
  • reduce model training results into form that can be saved into cluster state

We use existing scheduling framework in Search Relevance Workbench to schedule smaller training tasks and keep in-memory queue of pending tasks.

Based on the existing mapping, needed extension is minimal. Model id can be stored as part of the experiment “results” structure.

{
  "properties": {
    "id": { "type": "keyword" },
    "timestamp": { "type": "date", "format": "strict_date_time" },
    "type": { "type": "keyword" },
    "status": { "type": "keyword" },
    "querySetId": { "type": "keyword" },
    "searchConfigurationList": { "type": "keyword" },
    "judgmentList": { "type": "keyword" },
    "size": {"type": "keyword"},
    "results": { "type": "object", "dynamic": false },
    "optimizationMode": { "type": "keyword" }
  }
}

Questions for later versions

  • effective retry strategies for failed training sub-tasks (exponential backoff with limited retries)
  • keep count of failed training sub-tasks. If number crosses critical level threshold cancel training and mark whole process as failed

Embedded Scorer

This component is responsible for loading model parameters and spin up an java representation of the model. It can be implemented as part of phase result processor with following responsibilities:

  • identify if incoming query is a hybrid query
  • read model parameters from cluster state
  • extract features from the incoming hybrid query
  • predict weights based on extracted features and model parameters

Dynamic query weights in normalization processor

Existing normalization processor need following changes

  • if present, identify what’s the type of each sub-query (lexical vs semantic vs generic)
  • pass predicted weights to scores combiner, where they are applied to normalized scores and final document score is calculated

For both components we can utilize existing Normalization processor. Only change in interface that’s needed is adding a model id for weights prediction. Following request example showing hybrid query with inline definition of search pipeline:

{
    "query": {
        "hybrid": {
            "queries": [
                {"match":},
                {"neural":}
            ]
        }
    },
    {
        "search_pipeline": {
        "description": "Hybrid search with ML-based weight optimization",
        "phase_results_processors": [
            {
                "normalization-processor": {
                "normalization": {
                    "technique": "min_max"
                },
                "combination": {
                    "technique": "arithmetic_mean"
                },
                "weight_prediction": {
                    "model_id": "{{model_id}}"
                }
            }
        ]
      }
   }
}

To identify the query class (lexical/semantical) we can prepare map of query types. As weak alternative we can request this information from user (not preferred as it relies on user’s expertise and good intentions).

Backward Compatibility

This is new feature, no major concerns regarding BWC. Only potential point for concern is optimizationMode in experiments API. If this field is not provided we consider experiment a global optimization

We assume that for this feature following areas in Search Relevance Workbench and Neural Search remain stable:

  • query set
  • search configuration
  • judgment ratings
  • normalization processor

Security

Main concerning area is APIs, there is where we accept user input. Initial scope is limited in terms of information we accept with request, these are mainly ids of existing system entities and text information like model id or model description. Impact of malicious input for those parameters can be minimized by following best practices and adding strict validation of parameters like length of the string, existence of system entities with provided ids etc.
Access control for new API will be same as for other existing APIs in Search Relevance Workbench.

Benchmarking

Quality of predictions can be evaluated using existing tools for checking relevance metrics, they are based on BEIR datasets and corresponding evaluation tools in their repository. Team can use customized version of those tools https://github.com/martin-gaievski/info-retrieval-test/tree/dynamic_optimizer_feature_eng_esci_dataset. As a dataset for evaluation we recommend to use ESCI dataset (Amazon product search) https://github.com/amazon-science/esci-data.

At high level we run search workload using globally predicted weights and compare them with results based on dynamically predicted weights. We use main relevance metrics to compare model effectiveness: NDCG, Recall, Precision, MAP.

References

Feedback Required

Feature engineering priorities: what query and result features would be most valuable for your use cases?

We've identified basic query features (length, token count, special characters) and search result features (BM25 scores, semantic similarities) for the initial framework. However, different domains likely benefit from different feature sets.

  • What domain-specific features have you found effective for search relevance?
  • Are there query characteristics (e.g., intent classification, entity recognition) that significantly impact optimal weight selection in your applications?
  • How do you balance feature richness against inference latency requirements?

Query text consistency requirements: is the requirement for identical query text across sub-queries too restrictive for your hybrid search implementations?

Our current design requires that all sub-queries (lexical, semantic, etc.) use identical query text to enable consistent feature extraction. This simplifies the initial framework but may limit real-world applicability.

  • Do your hybrid queries typically use the same text across sub-queries, or do you often modify text for different query types?
  • Would support for query text variations (with more complex feature extraction) be worth the added implementation complexity?
  • Are there alternative approaches to feature extraction that could handle query text differences while maintaining prediction accuracy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCIssues requesting major changes

    Projects

    Status

    🆕 New

    Status

    Not Committed

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions