-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Background:
Currently, in OpenSearch Search Relevance Workbench, users can run an experiment to evaluate search quality. However, this process is manual, and there is not a built-in way to schedule these search evaluations. With dynamically changing data, the quality of searches could change, and it is important for the users to be notified if search quality drops below a certain threshold. The first step before enabling these notifications would be to automate search evaluations. The proposition is to enable regularly running search evaluations based on a schedule. By doing this, users could have the search evaluation results published based on a defined schedule.
Issue:
Currently, users cannot automatically schedule their search experiments. Instead, when a search experiment is run, the results and configurations are stored in the workbench and can be accessed later, however, the results are only limited to that point in time. While users could rerun the search evaluation by resubmitting an experiment with the original parameters, that process would be manual and impractical to repeat according to a schedule.
There is also no alerting system to let users know when their search quality drops, and while the details of that system are out of scope for this RFC, automating the evaluation schedule is the first step to monitoring search quality.
Proposed Solution:
Our proposed solution is to use the job-scheduler plugin and apply the job-scheduler framework to the search relevance plugin.
Job Scheduler Architecture:
flowchart LR
JobIndex[Job Index] -->|A job is indexed| Sweeper
Sweeper(Job Sweeper) --> |Index is listened to,
and the runner
is invoked |Runner(Job Runner)
Runner -->|Locking service
is used before
job is submitted
to threadpool| LockingService[Locking Service]
LockingService -->|Locking service
is used before
job is submitted
to threadpool| Threadpool
The main components of the job scheduler are the core scheduler, job index, job parameters, job runner, the schedule, the thread pool, and the locking service.
- The core scheduler is the main component that runs the jobs. It is responsible for both submitting and deleting jobs. It holds a mapping to the jobs in the system and has its own thread pool to submit tasks at a certain time.
- The job index stores all the jobs that are currently in the cluster. What happens is that whenever a job is indexed, it is scheduled and will run at the next scheduled time. What makes this possible is that the job sweeper is set to listen to index changes on the job index, so once a job document is indexed, it will automatically be submitted to the scheduler.
- The job parameters hold the necessary information to run the jobs. This is actually an extensible interface where we could add our own parameters to enable the specific job we would like to run.
- The job runner specifies the action that should be run periodically. When the method runJob is called, it simply submits the job to run into the thread pool where it will be executed. Here, we could extend the runJob method to customize how we would like to run the job and handle the job parameters that come in.
- The schedule specifies how we would like to run the job. It is responsible for taking in our input for scheduling the job and converting it to either an interval or cron schedule. For our purposes, we would use a cron schedule so that the jobs can run at a standardized time rather than being determined by when the job was submitted if we use an interval schedule.
- The thread pool holds the executors of the tasks within the system. It is initialized when the plugin is initialized. The 2 main purposes of the thread pool are to execute the job itself and also schedule the job for future execution. As default, the current job scheduler uses 200 processors, however, there might be some concerns with how much access the thread pool should have to the CPU. Additionally, when multiple jobs happen to be scheduled at the same time, CPU burst could occur since the CPU is being overutilized at one point in time.
- The locking service is used to ensure that duplicates of the same job are not run at the same time. When a job is currently running, it should acquire a lock based on the job-index and the id of the job (since the job submitted has a fixed it). Jobs that cannot acquire a lock should simply be dropped. Imagine that we did not have a locking service and the same job is being rescheduled before at a time before it is finished. Then what could happen is that cpu resources could be taken up and slow down the jobs that are being run while the job scheduler is simply adding more and more of the same job. Eventually, this could overwhelm the thread pool and also crash the program
Changes to be made to interfaces
Some of the extensible objects we must consider implementing are the job parameters, the plugin object (especially for the initialization methods), and the job runner.
- For the job parameters we would want to make sure that on top of the base parameters provided, we also have a specific experiment id to reference and rerun.
public class SearchRelevanceJobParameters {
private String jobName;
private Instant lastUpdateTime;
private Instant enabledTime;
private boolean isEnabled;
private Schedule schedule;
private String indexToWatch;
private Long lockDurationSeconds;
private Double jitter;
private String experimentId;
}
Here, we made sure that a specific experiment can be referenced for scheduling after the job is submitted.
- We want the SearchRelevancePlugin to implement the JobSchedulerExtension class and initialize the job runner with the necessary parameters. Additionally, we need to extend the method getJobParser to enable serialization of job parameters to the scheduler running in the core of the job scheduler plugin.
- We want to implement the job runner. On initialization of the plugin, the job runner is initialized with sample parameters. Since we simply want to run an experiment based on an id, we first need to have access to the experiment index so that the parameters that the experiment used to run such as the query and search configurations can be available. To do this, we should add the experimentDao as an attribute to the job runner. Next, we simply want to rerun the experiment. An idea for doing this is to reuse the logic from executeExperimentEvaluation method in PutExperimentTransportAction . To do this we would have to include the metricHelper to help initialize the required processors for the different experiments such as MetricsHelper, HybridOptimizedExperimentProcessor, and PointwiseExperimentProcessor. However, simply rewriting the logic is not helpful for extensibility of the code in case we want to add more experiments, therefore, we should transport that logic to a helper class.
Experiments that can be rerun
Currently there are 3 types of experiments in SearchRelevance. The pointwise experiment, pairwise experiment, and the hybrid optimization experiment. All of these are actually evaluated through the same path in PutExperimentTransportAction
, therefore, we could schedule all 3 types of experiments to run regularly.
It is important to note that there will only be at most one schedule per experiment. To enforce this, the id of the experiment that is being regularly run will be identical to the id of the job that is running that experiment.
New API details for scheduling, getting, and deleting jobs.
The endpoint /_plugins/_search_relevance/experiment/schedule should contain 3 methods POST, GET, and DELETE
POST:
Url: /_plugins/_search_relevance/experiment/schedule
Sample request body scheduling at 1 AM daily:
{
“experiment_id”: “70dd4e5d-fcf0-424t-a215-2fetfgfler67”,
“cron_expression”: “* 1 * * *”
}
Sample response:
{
“job_id”: “ 70dd4e5d-fcf0-424t-a215-2fetfgfler67”
“job_result”: “CREATED”
}
GET:
Url: /_plugins/_search_relevance/experiment/70dd4e5d-fcf0-424t-a215-2fetfgfler67/schedule
Sample response:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": ".scheduled-jobs",
"_id": "70dd4e5d-fcf0-424t-a215-2fetfgfler67",
"_score": 1.0,
"_source": {
"id": "70dd4e5d-fcf0-424t-a215-2fetfgfler67",
"name": "experiment-parameters",
"enabled": true,
"schedule": {
"cron": {
"expression": "* * * * *",
"timezone": "America/Los_Angeles"
}
},
"indexNameToWatch": "index",
"enabledTime": 1756690489946,
"lastUpdateTime": 1756690489946,
"lockDurationSeconds": 20,
"experimentId": "70dd4e5d-fcf0-424t-a215-2fetfgfler67"
}
}
]
}
}
DELETE:
Url: /_plugins/_search_relevance/experiment/{{job_id}}/schedule
Sample response:
{
“Successful”: True
}
We should also have a feature flag like “plugins.search_relevance.job_scheduler_enabled” to first enable whether the api could be used or not.
Performance Concerns:
- One huge concern is resource management. Since there are so many potential schedules that could be running at the same time, we want to make sure the users can monitor their resources. One idea that came up on how to manage this is by using sand boxing. The idea is that we want to allow users to place limits on how much resources the queries could use. Another idea that came up is to allow workload group management. Here we could create groups of tasks for job scheduling requests and make sure that the groups stay below resource constraints.
- Another concern is the potential for CPU bursts. What if all the scheduled jobs only happen at a certain point in time and use up those resources. One way to manage this is the built-in jitter job parameter. What this does is introduce randomness when the next job is scheduled. For example, if the jitter value is 0.6, this causes a job with an interval of 10 minutes to be delayed by somewhere between 0 and 6 minutes. This helps distribute the jobs, but a trade off is that precision is lost. However, if the use simply needs a job to run approximately once a day and is not worried about precision, this delay should not be a serious concern.
Security Concerns:
- We must check that the cron expression inputted is valid early.
- We must also make sure that rate limiting is included so that users cannot submit too many scheduled evaluations.
- There should be a separation in about of threads allocated to the scheduler because the thread pool will be shared with the rest of the search relevance plugin.
Potential extensions:
With a regularly running evaluation enabled, we can now automatically monitor search quality over time. There should also be a mechanism to enable alerting if a given metric falls below a threshold. We could use opensearch-notifications or the opensearch-alerts plugins to help implement.
Related issues & work:
Pull request: #220 (technical design document is included with the pull request)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status