-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Dynamic Weight Optimization for Hybrid Search
Introduction
Hybrid search combines multiple query types, like keyword and neural search, to improve search relevance. In 2.11, the team has release hybrid query that is part of the neural-search plugin. Main responsibility of the hybrid query is to return scores of multiple queries that are normalized and combined.
Main way of improving relevance of the hybrid search results is through sub-query weights. By assigning greater or less coefficient to lexical and semantic sub-queries we can increase or decrease their respective contribution to the final combined document score. Initially, identifying weights is the user's responsibility. The Search Relevance Workbench introduced a straightforward approach for identifying suitable hybrid search parameters by trying out a hard-coded set of different alternatives with a predefined query set and hybrid search configuration.
Problem Statement
With hybrid experiment users can identify optimal weights that are best on average for the whole set of documents and queries. However, this approach produces exactly one parameter combination as “the best” one for all queries. With this parameter combination, some queries will benefit (as in their search quality metrics improve) and other queries will not benefit (as in their search quality metrics decrease) as described in the “Hybrid Search Optimization” blog post.
This RFC proposes a framework for dynamic weight optimization that can predict query-specific weights. The initial implementation establishes the foundation for this capability, with the expectation that future iterations will improve prediction accuracy through enhanced features and models.
Requirements
Dynamic hybrid search optimization is a search relevance tuning operation for advanced users that requires machine learning knowledge, specifically around feature selection and engineering processes and model training.
Functional Requirements
- system should predict weights for lexical and semantic parts of hybrid query at “per query” level
- weight prediction is made based on the query text, we’re adding limitation that such text must be identical among all sub-queries
- fallback to pre-configured global optimization weights in case of dynamic weights are not available
- framework shall support extensible feature engineering for future enhancements
Non functional requirements
- minimize added latency. target is <10ms additional latency for weight prediction measured by 95th percentile query latency increase
- provide clear extension points for improved models in future versions
Out of Scope
- details of how scores are normalized and combined for hybrid query, we are using existing OpenSearch techniques
- details of training of ML models involved into weights prediction. We care about them as building blocks for our solution and threat then as black boxes
Current State
Currently weights for score combination are predicted globally for the whole dataset based on the average relevance metrics from the test query set.

Solution Overview
To achieve best results in terms of relevance we propose to use ML techniques for weight’s prediction. Such predicting has the requirements that all ML-powered applications have:
- Feature selection and feature engineering
- Model training and serving
- Model inference
This proposal introduces a foundational framework for ML-based weight prediction in hybrid search. The framework supports:
- extensible feature engineering - starting with basic query features
- pluggable model architecture - beginning with linear regression
- scalable inference pipeline - designed for future enhancements
Version 1 Focus: Establish the core framework and processing pipeline, providing a baseline for future improvements rather than optimizing for immediate performance gains.
Following diagram shows high level flow that is needed to train weight prediction model and use it in hybrid queries.

Following main components are part of the future framework
Feature engineering pipeline
The framework provides a standardized feature extraction system for hybrid queries:
- query-level features: Length, token count, presence of numbers/special characters
- result-level features: BM25 scores, semantic similarity scores, result counts
- extensibility: Clear interfaces for adding domain-specific features in future versions
Model integration architecture
Version 1 supports embedded linear regression models with:
- co-located processing: Models execute within the search pipeline for minimal latency
- cluster state storage: Model parameters stored as lightweight cluster metadata
- fallback mechanism: Automatic reversion to static weights when prediction fails
Weight application system
Predicted weights integrate with existing hybrid search components:
- normalization processor enhancement: Accepts dynamic weights alongside static configuration
- score combination: Applies predicted weights during arithmetic mean calculation
- query DSL compatibility: Works with existing hybrid query syntax
Option 1: Embedded scorer [Recommended]
Model is trained inside OpenSearch. Trained model parameters are stored separately and model logic implemented as Java code. Combined logic and model parameters is what we call embedded scorer.
Other component of this solution:
- normalization processor accepts weights for combination as dynamic parameters
- need categorization mechanism for query type: lexical/semantic
- predefined query template to make sure same query test between sub-queries
Following are query features we use for model training
- Basic features
- query length
- token count
- has numbers (boolean)
- has special characters (boolean)
- Lexical search result features
- number of results for the lexical query.
- maximum title score: maximum score of the titles of the retrieved top 10 documents. The scores are BM25 scores calculated individually per result set. That means that the BM25 score is not calculated on the whole index but only on the retrieved subset for the query, making the scores more comparable to each other and less prone to outliers that could result from high IDF values for very rare query terms.
- sum of the title scores of the top 10 documents, again calculated per result set. We use the sum of the scores (and no average value) as an aggregate to measure how relevant all retrieved top 10 titles are. BM25 scores are not normalized, so using the sum instead of the average seemed reasonable.
- Neural search result features
- maximum semantic score of the retrieved top 10 documents. This is the score we receive for a neural query based on the query’s similarity to the title.
- average semantic score: In contrast to BM25 scores, the semantic scores are normalized and in the range of 0 to 1. Using the average score seems more reasonable than attempting to calculate the sum.
- Other less common domain specific features, evaluation is need if those features are effective and can be collected from dataset: currency, size, SKU, is question, is medical acronym, has citation, is stock ticker, has price
High level workflow
- Upload data set [Core OpenSearch]
- Add user queries, OpenSearch DSL query and judgments to search relevance workbench. They are stored as query set, search configuration and judgment ratings. [Search Relevance Workbench]
- Train hybrid query weights prediction model, pass query set, search configuration and judgment ratings ids. Model metadata is stored as part of the search relevance internal index (or maybe we just keep it in memory). This should be exportable json-like document [Search Relevance Workbench]
- Import model parameters and store them as part of the cluster state. Create embedded scorer using those stored parameters. [Core OpenSearch]
- At query time identify if query needs dynamic optimization, call embedded scorer (read model parameters) and apply weights to hybrid query [Neural search]

Pros:
- fast due to co-location with neural-search plugin code, no transport calls and de/serializations
- no model deployment and connector needed, which is important for managed cloud environments with limited extensibility
- depending on how model parameters are stored, they can be manually editable with no need to retrain the model
Cons:
- limited set of query features are supported (only queries defined as part of model training)
- limited model types are supported due to complexity of internal model logic. linear regression
- need separate persistent mechanism to store data extracted from model
- less error prone comparing to pre-trained models because we need new component that receives the model and does the calculations
- categorization mechanism for query type (lexical/semantic) is limited
- limitations on query text variability (text must be same between sub-queries)
Option 2: External simple model
Similar to Option 1 except the weights predicting model is accessed via ml-commons. Model can be simple like linear regression, that is deployed locally, or larger LLM that hosted remotely.

Pros:
- flexibility, virtually any model type is supported
- more error prone, no need in extra steps of converting model to java code (no embedded scorer) and store model parameters in OpenSearch
- simpler implementation, higher use of existing components
Cons:
- extra latency due to remote predict calls to model
- limited set of query features are supported (only queries defined as part of model training)
- categorization mechanism for query type (lexical/semantic) is limited
- limitations on query text variability (text must be same between sub-queries)
- need extra setup for model connector
- may not work in restricted deployment environments due to external model hosting requirements
Option 3: External LLM
This option is making next step comparing to Option 2 - instead of simple model trained on exact dataset using query features we can use LLM and throw the whole query text there.

Other component of this solution:
- normalization processor accepts weights for combination as dynamic parameters
- predefined query template to make sure same query test between sub-queries
- prompt for LLM
Pros:
- flexibility, virtually any model type is supported
- simplest option: no need to train model, no need in extra steps of converting model to java code (no embedded scorer), no need to store model detail in OpenSearch
- less dependent on query text features
- potentially we can predict which techniques provide the best relevance
Cons:
- extra latency due to remote predict calls to model, presumably higher then in Option 2 (100+ ms)
- potentially limited throughput, model can throttle requests due to high resource utilization
- limitations on query text variability (text must be same between sub-queries)
- need extra setup for model connector
- may not work in restricted deployment environments due to external model hosting requirements
Solution Comparison
Solutions are offering a tradeoff between flexibility and performance.
Solutions for dynamic optimizer - comparison table
Criteria | Option 1: Embedded Scorer (Recommended) | Option 2: External Simple Model | Option 3: External LLM |
---|---|---|---|
Performance characteristics | |||
Latency | Low - Co-located with neural-search plugin | Medium - Network calls required | High - LLM inference time plus network overhead |
Throughput | High | Medium - Limited by external service | Low - Potential throttling from LLM service |
Resource utilization | Low - Minimal overhead | Medium | High - LLMs require significant resources |
Implementation | |||
Complexity | Medium - Need to convert models to Java code | Low - Uses standard model interfaces | Low - Uses standard LLM APIs |
Model types supported | Limited - Primarily linear regression | High - Any supported model type | High - LLMs with prompt engineering |
Feature engineering effort | High - Careful feature selection needed | High - Same as Option 1 | Low - LLM can process raw queries |
Operational considerations | |||
Managed environment compatibility | Yes | Limited - Depends on connector | Limited - Depends on connector |
External dependencies | None | Required - Model hosting service | Required - LLM API service |
Model management | Complex - Need persistence mechanism | Simple - Managed externally | Simple - Managed externally |
Infrastructure requirements | Minimal | Moderate - Model hosting | High - LLM infrastructure |
Capabilities | |||
Model sophistication | Basic | Moderate | Advanced |
Adaptability to query variations | Limited | Limited | High - LLMs handle text variations well |
Contextualization | Low | Low | High - Can understand query intent |
Feature utilization | Limited to engineered features | Limited to engineered features | Can extract features from raw text |
Constraints | |||
Query text consistency requirements | High - Text must be same between sub-queries | High - Text must be same between sub-queries | Medium - More tolerant of variations |
Setup complexity | Low | Medium - Requires connector setup | High - LLM integration and prompt engineering |
Maintainability | Medium - Need to update embedded code | High - External model updates are seamless | High - LLM updates managed by provider |
Error handling complexity | High - Internal errors harder to debug | Medium | Medium |
Based on how well each solution fits the criteria categories we can conclude these recommendations:
Option 1 (Embedded Scorer) is recommended for most use cases due to:
- best performance characteristics with minimal latency
- no external dependencies making it compatible with all deployment scenarios including managed cloud environments
- simplest operational deployment
Option 2 can be reconsidered when:
- more sophisticated models beyond linear regression are required
- external model management infrastructure already exists
- performance is not the primary concern
Option 3 can be reconsidered when:
- query variations are significant
- deep understanding of query semantics is required
- performance can be traded for higher accuracy
- external LLM infrastructure is already in place
Key Design Decisions
All following decisions are for recommended solution option.
- How model data is stored
We can use cluster state. Model metadata is relatively small (few Kb, liner regression model for ESCI dataset was 880 bytes). This storage survives node crush and cluster restart. Can be retrieved and tweaked by user if needed.
- How identify sub-query class
We can create a registry of the queries and corresponding types, e.g. match → lexical, neural/knn - semantic etc. In case of compound or complex query we ignore dynamic optimization and fall back to static weights. Another option to explore is register type for each query class and explore it with visitor pattern.
- How extract query text from OpenSearch query DSL
Registry of query types with keywords where the query text can be extracted. We fall back to static weights or fail in case there are multiple different query texts or some unknown query type.
- How compare relevance metrics during model training
We rely on user provided judgments for dataset and queries. Any document query pair that does not have judgment rating considered to by irrelevant (effectively judgment rating is 0.0). If judgments are missing user can generate them using Search Relevance Workbench and generate LLM judgment functionality.
Open Questions
Which ML simple model is most effective
For initial version we need to pick one model type that is:
- relatively simple to be converted into Java code
- provide most relevant results for random/general datasets
While doing POC I tested following models using ESCI dataset
- linear regression
- logistic regression
- gradient_boosting
- random forest (tested for comparison, will be hard to convert to Java)
Following table represents data collected out of that POC
Model Type | Accuracy (NDCG@10) | Training Time | Inference Latency | Interpretability | Implementation Complexity | POC Suitability |
---|---|---|---|---|---|---|
Linear Regression | 0.82 | <1 sec | <5ms | High | Simple | Excellent |
Random Forest | 0.87 | 5-10 sec | 15-20ms | Medium | Moderate | Good |
Neural Network | 0.89 | 30-60 sec | 25-30ms | Low | Complex | Poor |
XGBoost | 0.88 | 10-15 sec | 10-15ms | Low-Medium | Moderate | Fair |
Short Term/Mid Term/Long Term implementation
In Short term we can start with Option 1 implementation where model is stored locally in the cluster. Few other limited solution options that make sense for Short term implementation:
- Use only basic query features (only those that can be extracted from the query text itself)
- Model type for embedded scorer is fixed, exact type will be identified based on benchmark data
- Judgment ratings (aka ground truth) are provided by user, we rely on quality of those judgments
In Mid/Long term we will add Option 2 as additional mode of dynamic optimization. This should increase variety of supported models for advanced users. Such change should be backward compatible, but will have limited support in Serverless. More features that are planned for later phases:
- complex query features for embedded scorer model
Potential Issues
Known limitations and Future extensions
With recommended solution options following limitations can be assumed
- support for limited model types trained on query features:
- Linear Regression
- Logistic Regression
- Polynomial Regression
- Ridge/Lasso Regression
- Simple Decision Trees
Solution LLD
Frontend
We need a new screen in Search Relevance Workbench to start model training. User needs to input following information:
- index (existing OpenSearch index with ingested data)
- id for following entities that needs to be imported beforehand
- user queries
- search configuration
- judgments
- any model related information (may be not needed if we go with a simplest for of using a single model type)
Optimal way is to re-use existing Hybrid Search Optimizer Experiment screen. We can add “Optimization” mode section with two mutual exclusive options - “Global” which is what we have today and it will be selected by default. Second option is new “Dynamic” mode.
Following are mocks for new UI.
This would be Hybrid Search Optimizer Experiment initial screen. “Global” optimization mode has been pre-selected.

This is how screen changes when user selects Dynamic mode for optimization:

Backend
In Search Relevance Workbench backend we need to add following components:
- modify Experiment API in Search Relevance Workbench for training the model. This is async API already, and this is perfect fit because model training can be long running (~10 mins for linear regression model used in POC) and most likely times out. Model parameters are stored in the cluster metadata at the end of the training. We keep minimal record in Experiment index to allow user monitor the training progress.
- new search processor that will identify if incoming query is hybrid query with dynamic optimization flag, and in such scenario will extract query features and call embedded scorer to predict weights. Those weights are set in context of pipeline
- modifications in existing normalization processor, it needs to read the predicted weights and apply them in the process of score normalization and combination. If for some reasons that cannot be done system falls back to static weights provided as part of the pipeline.
Following are details for each of those initial version items
Model training
In Search Relevance Workbench backend we use existing API experiments.
for a simple case in initial version we can use simplified format omitting parameters that have only one possible value
PUT /_plugins/search_relevance/experiments
{
"querySetId": "{{query_set_id}}",
"searchConfigurationList": ["{{hybrid_search_config_id}}"],
"size": 10,
"judgmentList": ["{{judgment_list_id_1}}"],
"type": "HYBRID_OPTIMIZER",
"optimizationMode": "dynamic"
}
Parameter name | Type | Description | Default value |
---|---|---|---|
optimizationMode | keyword | defines the experiment type, allowed values: global | dynamic |
Sample response
{
"experimentId": "{{experimentId}}",
"modelId": "{{gneratedModelId}}"
"status": "CREATED"
}
To effectively run model training we need to do following steps:
- split training workload into reasonably small tasks
- run few tasks in parallel and schedule the rest of tasks using task queue
- keep draining that task queue until all tasks are executed
- finalize training results
- reduce model training results into form that can be saved into cluster state
We use existing scheduling framework in Search Relevance Workbench to schedule smaller training tasks and keep in-memory queue of pending tasks.
Based on the existing mapping, needed extension is minimal. Model id can be stored as part of the experiment “results” structure.
{
"properties": {
"id": { "type": "keyword" },
"timestamp": { "type": "date", "format": "strict_date_time" },
"type": { "type": "keyword" },
"status": { "type": "keyword" },
"querySetId": { "type": "keyword" },
"searchConfigurationList": { "type": "keyword" },
"judgmentList": { "type": "keyword" },
"size": {"type": "keyword"},
"results": { "type": "object", "dynamic": false },
"optimizationMode": { "type": "keyword" }
}
}
Questions for later versions
- effective retry strategies for failed training sub-tasks (exponential backoff with limited retries)
- keep count of failed training sub-tasks. If number crosses critical level threshold cancel training and mark whole process as failed
Embedded Scorer
This component is responsible for loading model parameters and spin up an java representation of the model. It can be implemented as part of phase result processor with following responsibilities:
- identify if incoming query is a hybrid query
- read model parameters from cluster state
- extract features from the incoming hybrid query
- predict weights based on extracted features and model parameters
Dynamic query weights in normalization processor
Existing normalization processor need following changes
- if present, identify what’s the type of each sub-query (lexical vs semantic vs generic)
- pass predicted weights to scores combiner, where they are applied to normalized scores and final document score is calculated
For both components we can utilize existing Normalization processor. Only change in interface that’s needed is adding a model id for weights prediction. Following request example showing hybrid query with inline definition of search pipeline:
{
"query": {
"hybrid": {
"queries": [
{"match":},
{"neural":}
]
}
},
{
"search_pipeline": {
"description": "Hybrid search with ML-based weight optimization",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean"
},
"weight_prediction": {
"model_id": "{{model_id}}"
}
}
]
}
}
}
To identify the query class (lexical/semantical) we can prepare map of query types. As weak alternative we can request this information from user (not preferred as it relies on user’s expertise and good intentions).
Backward Compatibility
This is new feature, no major concerns regarding BWC. Only potential point for concern is optimizationMode in experiments API. If this field is not provided we consider experiment a global optimization
We assume that for this feature following areas in Search Relevance Workbench and Neural Search remain stable:
- query set
- search configuration
- judgment ratings
- normalization processor
Security
Main concerning area is APIs, there is where we accept user input. Initial scope is limited in terms of information we accept with request, these are mainly ids of existing system entities and text information like model id or model description. Impact of malicious input for those parameters can be minimized by following best practices and adding strict validation of parameters like length of the string, existence of system entities with provided ids etc.
Access control for new API will be same as for other existing APIs in Search Relevance Workbench.
Benchmarking
Quality of predictions can be evaluated using existing tools for checking relevance metrics, they are based on BEIR datasets and corresponding evaluation tools in their repository. Team can use customized version of those tools https://github.com/martin-gaievski/info-retrieval-test/tree/dynamic_optimizer_feature_eng_esci_dataset. As a dataset for evaluation we recommend to use ESCI dataset (Amazon product search) https://github.com/amazon-science/esci-data.
At high level we run search workload using globally predicted weights and compare them with results based on dynamically predicted weights. We use main relevance metrics to compare model effectiveness: NDCG, Recall, Precision, MAP.
References
- Blog “Optimizing hybrid search in OpenSearch” https://opensearch.org/blog/hybrid-search-optimization/
- Hybrid Optimizer in Search Relevance Workbench https://docs.opensearch.org/latest/search-plugins/search-relevance/optimize-hybrid-search/
- RFC for Dynamic Hybrid Search Optimization in Search Relevance [RFC] Dynamic Hybrid Search Optimization #206
- RFC for Hybrid Optimizer in neural search [RFC] Hybrid Search Optimizer neural-search#934
Feedback Required
Feature engineering priorities: what query and result features would be most valuable for your use cases?
We've identified basic query features (length, token count, special characters) and search result features (BM25 scores, semantic similarities) for the initial framework. However, different domains likely benefit from different feature sets.
- What domain-specific features have you found effective for search relevance?
- Are there query characteristics (e.g., intent classification, entity recognition) that significantly impact optimal weight selection in your applications?
- How do you balance feature richness against inference latency requirements?
Query text consistency requirements: is the requirement for identical query text across sub-queries too restrictive for your hybrid search implementations?
Our current design requires that all sub-queries (lexical, semantic, etc.) use identical query text to enable consistent feature extraction. This simplifies the initial framework but may limit real-world applicability.
- Do your hybrid queries typically use the same text across sub-queries, or do you often modify text for different query types?
- Would support for query text variations (with more complex feature extraction) be worth the added implementation complexity?
- Are there alternative approaches to feature extraction that could handle query text differences while maintaining prediction accuracy?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status