Scheduler for regular search evaluation runs #220

ajleong623 · 2025-08-12T23:22:08Z

Description

For requesting a regularly scheduled search evaluation, the user could add an cron parameter to denote the cron job schedule for running search evaluation.

Some changes that are made are that there are now 3 new APIs for interacting with scheduling experiments. The endpoints are experiment/<job_id>/schedule which is applied to the GET and DELETE methods and experiment/schedule which is applied to the GET and POST methods.

There are 2 new indices, .scheduled-jobs and search-relevance-scheduled-experiment-history. The purpose of the .scheduled-jobs index is to store the currently running experiment schedules. The search-relevance-scheduled-experiment-history index stores the historical experiment results with timestamps which were resulted from the scheduled job runner.

Unit and integration tests are provided, however, additions such as workload management, integration with alerting and resource monitoring are not available in this pull request, but I would like to add those into a future pull request.

Please let me know if there are any questions or concerns.

Issues Resolved

#213 #226

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Anthony Leong <[email protected]>

epugh · 2025-08-13T15:29:04Z

Post discussion with @wrigleyDan and @epugh we are going to change direction a bit and make the API take in an ALREADY EXISTING Experiment ID, and use that (and it's associated settings) to run the experiment every iteration.

Let's move to a cron pattern versus a interval.

We need to think about if we need a limit to how many experiments can be run...

…jobs index Signed-off-by: Anthony Leong <[email protected]>

epugh

Progress! We are now on the cron pattern. Now to think about nesting the API under the /experiment/{experiment_id}/schedule name space.

src/main/java/org/opensearch/searchrelevance/common/PluginConstants.java

src/main/java/org/opensearch/searchrelevance/dao/ScheduledJobsDao.java

src/main/java/org/opensearch/searchrelevance/rest/RestPostScheduledExperimentAction.java

src/main/java/org/opensearch/searchrelevance/rest/RestDeleteScheduledExperimentAction.java

src/main/java/org/opensearch/searchrelevance/rest/RestPutExperimentAction.java

src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobRunner.java

Signed-off-by: Anthony Leong <[email protected]>

Co-authored-by: Eric Pugh <[email protected]> Signed-off-by: Anthony Leong <[email protected]>

Signed-off-by: Anthony Leong <[email protected]>

This reverts commit 7f6352d. Signed-off-by: Anthony Leong <[email protected]>

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 · 2025-09-01T06:49:55Z

I believe I have addressed the comments. One of them, I did add a TODO comment so that it can be addressed in the future. Right now, the solution to refactoring the logic of running experiments is a bit involved.

Signed-off-by: Anthony Leong <[email protected]>

epugh · 2025-09-02T18:11:34Z

You now just need to add soemthing to highlight this new Feature in the change log!

https://github.com/opensearch-project/search-relevance/blob/main/CHANGELOG.md#features

Signed-off-by: Anthony Leong <[email protected]>

martin-gaievski

Thank you for addressing comments. Some area still need improvements:

empty finally blocks - no resource cleanup implemented
nested async operations lack individual timeouts
no mechanism to cancel in-progress experiments during timeout

martin-gaievski · 2025-09-08T16:02:28Z

src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobRunner.java

+            FutureUtils.cancel(searchEvaluationTask); // Attempt to interrupt the running task
+        } catch (InterruptedException | ExecutionException e) {
+            log.error("Interrupt for scheduled experiment has occured!");
+        } finally {}


we're missing resource cleanup in finally block. yopu need to do something like this:

if (currentExperimentTask != null && !currentExperimentTask.isDone()) { currentExperimentTask.cancel(true); } manager.cleanupResources(parameter.getExperimentId());

I think I fixed this issue. For handling cleanups, I update the experiment result that was registered to the new async status of TIMEOUT.

martin-gaievski · 2025-09-08T16:05:39Z

src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobRunner.java

+        // TODO: A lot of the logic here is reused from PutTransportExperiment.
+        // Eventually we have to abstract it in another class to reduce complexity.
+
+        Runnable runnable = () -> {


you can implement timeout with a simple future wrapper task, something like:

public <T> CompletableFuture<T> withTimeout(CompletableFuture<T> future, long timeoutSeconds) { CompletableFuture<T> timeoutFuture = new CompletableFuture<>(); ScheduledFuture<?> timeout = scheduler.schedule(() -> { if (timeoutFuture.cancel(false)) { future.cancel(true); } }, timeoutSeconds, TimeUnit.SECONDS); // complete when original completes future.whenComplete((result, throwable) -> { timeout.cancel(false); // Cancel timeout if (throwable == null) { timeoutFuture.complete(result); } else { timeoutFuture.completeExceptionally(throwable); } }); return timeoutFuture; }

martin-gaievski · 2025-09-08T16:11:10Z

src/main/java/org/opensearch/searchrelevance/executors/ExperimentRunningManager.java

+        // If any of the futures fails, the exception would be handled
+        // in the logic of that future. Therefore, no action for failure
+        // is necessary here.
+        CompletableFuture.allOf(configFutures.toArray(new CompletableFuture[0])).join();


looks like here you don't have timeout control for individual operations, this could lead to resource leaks in high-load scenarios: main task will be interrupted but nested task will keep running. You can solve it by adding timeout wrapper to all async operations.

I added the timeout wrapper. The only issue is when I looked at FutureUtils, the API does not allow threads to be interrupted. However, not interrupting the threads defeats the purpose of the timeout.

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 · 2025-09-09T07:19:28Z

@martin-gaievski one last issue is that cancelled timeouts with thread interruption is not allowed. Does this mean that my solution would have to involve adding an atomic boolean that denotes whether the task was cancelled and then checking it at key points? This might not be sufficient in cancelling long running calculations such as the hybrid optimizer one. Additionally, I think would have to make sure all background tasks such as the asynchronous point wise experiment processing are cancelled manually. Please let me know of any feedback on the ideas I mentioned.

ajleong623 · 2025-09-09T07:24:12Z

Just a personal reminder to add documentation comments throughout code changes before submitting.

martin-gaievski · 2025-09-11T16:42:02Z

@martin-gaievski one last issue is that cancelled timeouts with thread interruption is not allowed. Does this mean that my solution would have to involve adding an atomic boolean that denotes whether the task was cancelled and then checking it at key points? This might not be sufficient in cancelling long running calculations such as the hybrid optimizer one. Additionally, I think would have to make sure all background tasks such as the asynchronous point wise experiment processing are cancelled manually. Please let me know of any feedback on the ideas I mentioned.

@ajleong623, excellent question about the thread interruption limitations. You're absolutely right to be concerned about this, especially for long-running operations like the hybrid optimizer. Let me provide some guidance on the best approach here.
Your intuition about using an AtomicBoolean for cancellation is spot-on. This is indeed the recommended pattern in OpenSearch for handling timeouts without thread interruption. Here's my suggested implementation strategy:

start from introducing the cancellation token approach/pattern:

public class ExperimentCancellationToken {
    private final AtomicBoolean cancelled = new AtomicBoolean(false);
    private final List<Runnable> cancellationCallbacks = new CopyOnWriteArrayList<>();
    
    public boolean isCancelled() {
        return cancelled.get();
    }
    
    public void cancel() {
        if (cancelled.compareAndSet(false, true)) {
            cancellationCallbacks.forEach(Runnable::run);
        }
    }
    
    public void onCancel(Runnable callback) {
        cancellationCallbacks.add(callback);
        if (isCancelled()) {
            callback.run();
        }
    }
}

identify check points in long-running operations

For the hybrid optimizer and other long-running calculations, you'll need to add cancellation check points at strategic locations:

// In HybridOptimizerExperimentProcessor
public void processHybridOptimizerExperiment(..., ExperimentCancellationToken cancellationToken) {
    for (String queryText : queryTexts) {
        // check before each query processing
        if (cancellationToken.isCancelled()) {
            handleCancellation();
            return;
        }
        
        for (SearchConfiguration config : configurations) {
            if (cancellationToken.isCancelled()) {
                handleCancellation();
                return;
            }
            // process configuration...
        }
    }
}

- special manual handling for asyn operations
async operations like pointwise experiments, may need manual cancellation. I suggest:
```java
// part ExperimentRunningManager
private final Map<String, List<CompletableFuture<?>>> runningFutures = new ConcurrentHashMap<>();

public void startExperimentRun(String experimentId, PutExperimentRequest request, ExperimentCancellationToken token) {
    List<CompletableFuture<?>> futures = new ArrayList<>();
    
    // register cancellation callback
    token.onCancel(() -> {
        futures.forEach(f -> f.cancel(false));
        runningFutures.remove(experimentId);
    });
    
    // track all async operations
    CompletableFuture<QuerySet> querySetFuture = fetchQuerySetAsync(...);
    futures.add(querySetFuture);
    runningFutures.put(experimentId, futures);
    
    // continue experiment...
}

integrate cancellation token into ConcurrencyUtil. For instance, ConcurrencyUtil.withTimeout is a good place to such integration:

public static <T> CompletableFuture<T> withTimeout(
    CompletableFuture<T> future, 
    long timeoutSeconds, 
    ThreadPool threadPool,
    ExperimentCancellationToken cancellationToken) {
    
    CompletableFuture<T> timeoutFuture = new CompletableFuture<>();
    
    ScheduledFuture<?> timeout = threadPool.scheduler().schedule(() -> {
        cancellationToken.cancel();  // Signal cancellation instead of interrupting
        timeoutFuture.completeExceptionally(new TimeoutException());
    }, timeoutSeconds, TimeUnit.SECONDS);
    
    ...
}

Some other important considerations:

granularity of checks, for the hybrid optimizer, consider adding checks:
- before/after each query evaluation
- inside any loops that process multiple configurations
- before expensive operations (network calls, large computations)
ensure all resources (connections, temporary data) are properly cleaned up when cancellation occurs.
consider whether to save partial results when an experiment is cancelled due to timeout.
add unit tests that specifically test the cancellation behavior at various points in the execution.

One alternative to cancellation token approach is chucking. For very long-running operation we may want to break it into smaller chunks. To me this is more complex then tokens, mainly due to uncertainty of how exactly break the task and how to handle partial results.

Overall, your proposed solution with AtomicBoolean is the right direction, great work on identifying this important consideration

fen-qin · 2025-09-15T18:40:10Z

src/main/java/org/opensearch/searchrelevance/utils/ConcurrencyUtil.java

+
+        ScheduledFuture<?> timeout = threadPool.scheduler().schedule(() -> {
+            if (timeoutFuture.cancel(false)) {
+                future.cancel(true);


please resolve the build failure

Forbidden method invocation: java.util.concurrent.Future#cancel(boolean) [Don't interrupt threads use FutureUtils#cancel(Future<T>) instead]

sample code:

// Before: // future.cancel(true); // After: FutureUtils.cancel(future);

Hi, I looked into the FutureUtils implementation and noticed that the thread cannot be interrupted on cancel which means the long-running task might still be running after cancel. Martin and I have been working on a workaround for that issue, and I will use the proper api.

…t for async tasks Signed-off-by: Anthony Leong <[email protected]>

ajleong623 · 2025-09-17T00:41:16Z

@martin-gaievski SearchRelevanceJobRunner is where the scheduler starts. In line 95, when creating the future with timeout, I attached a countdown latch to indicate when all the asynchronous results finish. This either happens during updateFinalExperiment or handleAsyncFailure in ExperimentRunningManager. That way, we can wait on that latch and only clean up after all the asynchronous operations have finished. I noticed the code flow of completing the task itself is actually much faster because once a future is scheduled, the code moves on. Therefore, the asynchronous tasks could still be running even after the task completes or times out. Additionally, I took your suggestion of grouping the futures in a map called runningFutures so that they can cancel right when the cancellation token is cancelled.

In ScheduledExperimentRunnerManager, the placeholder for the ScheduledExperimentResult is created and placed into the index. The checks will be in lines 89 and 95. (before and after the put).

In ExperimentRunningManager, the query set is first fetched, then search configurations are fetched, and finally for all the queries in the query set, one of the experiments is run. The checks are in lines 145 (adding async futures to be cancelled), 209 (for each search configuration fetch), 345 (each experiment evaluation loop around query text), and 443 (right before results are processed for each evaluation).

In HybridOptimizerExperimentProcessor the only check is in line 243 which is where the loop for scheduling a variant set for each search configuration is processed.

In ExperimentTaskManager, line 257 (for each time an experiment is scheduled), 291/299 (before and after submitting into the threadpool), and 326 (scheduling each variant asynchronously) are where the checks are available. The most important checks are around 291 and 299 because the tasks submitted to the threadpool in hybrid optimization are the longest running.

For cleanup, I handled each case similarly to how failures detections are handled. However, in the final cleanup in SearchRelevanceJobRunner line 113, the scheduled experiment result is simply updated. I do not know about temporary data being created, and index connections are not interrupted.

I also have not handled partial results, but it will be null if timeout occurs

Let me know if you have any questions or comments. I need another look because I have been working on this for a while, and my brain is currently fried.

src/main/java/org/opensearch/searchrelevance/plugin/SearchRelevancePlugin.java

src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobParameters.java

Signed-off-by: Anthony Leong <[email protected]>

…ents on underlying experiment deletion Signed-off-by: Anthony Leong <[email protected]>

Signed-off-by: Anthony Leong <[email protected]>

…relevance into job-scheduler

Signed-off-by: Anthony Leong <[email protected]>

epugh · 2025-10-09T19:00:26Z

src/main/java/org/opensearch/searchrelevance/dao/ScheduledJobsDao.java

+    /**
+     * List scheduled jobs by source builder
+     * @param sourceBuilder - source builder to be searched
+     * @param listener - action lister for async operation


Suggested change

* @param listener - action lister for async operation

* @param listener - action listener for async operation

epugh · 2025-10-09T19:02:18Z

src/main/java/org/opensearch/searchrelevance/executors/ExperimentRunningManager.java

+import lombok.extern.log4j.Log4j2;
+
+/**
+ * ExperimentRunningManager helps isolate the logic for running the logic in


slight awk phrasing.

ajleong623 · 2025-10-09T19:33:03Z

@martin-gaievski @fen-qin I think I am ready for the next round of code reviews as I believe I addressed the comments mentioned prior. Please let me know about any other suggestions or concerns.

add job scheduler

61b6a6a

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 marked this pull request as draft August 12, 2025 23:22

ajleong623 added 2 commits August 12, 2025 16:36

add job scheduler plugin

ade1192

Signed-off-by: Anthony Leong <[email protected]>

fixed pairwise error

50319e4

Signed-off-by: Anthony Leong <[email protected]>

epugh linked an issue Aug 13, 2025 that may be closed by this pull request

[FEATURE] Scheduling for running evaluations regularly #213

Open

epugh added the v3.3.0 label Aug 13, 2025

added actions for scheduling and deleting jobs, validations, and new …

b554ea6

…jobs index Signed-off-by: Anthony Leong <[email protected]>

epugh previously requested changes Aug 20, 2025

View reviewed changes

ajleong623 mentioned this pull request Aug 20, 2025

[RFC] Running Regularly Scheduled Search Evaluations Design #226

Open

ajleong623 and others added 5 commits August 20, 2025 16:43

added initial draft of technical design

8309d93

Signed-off-by: Anthony Leong <[email protected]>

made changes based on small suggestions

4b5340a

Signed-off-by: Anthony Leong <[email protected]>

Apply suggestions from code review

7f6352d

Co-authored-by: Eric Pugh <[email protected]> Signed-off-by: Anthony Leong <[email protected]>

made changes based on small suggestions

b73746a

Signed-off-by: Anthony Leong <[email protected]>

Revert "Apply suggestions from code review"

fd62b14

This reverts commit 7f6352d. Signed-off-by: Anthony Leong <[email protected]>

ajleong623 force-pushed the job-scheduler branch from 96e8016 to fd62b14 Compare August 21, 2025 07:03

ajleong623 added 6 commits August 21, 2025 00:07

reapply changes from suggestion

d3d053a

Signed-off-by: Anthony Leong <[email protected]>

add new persistent index and modified request url

d338bf4

Signed-off-by: Anthony Leong <[email protected]>

still need to add integration tests

bb3426d

Signed-off-by: Anthony Leong <[email protected]>

finished all integration tests

03725fa

Signed-off-by: Anthony Leong <[email protected]>

update gradle build file

efdc0c6

Signed-off-by: Anthony Leong <[email protected]>

update design document

95581cb

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 marked this pull request as ready for review September 1, 2025 06:49

ajleong623 added 3 commits September 1, 2025 00:00

update build file

a547b68

Signed-off-by: Anthony Leong <[email protected]>

update build.gradle

fde4c04

Signed-off-by: Anthony Leong <[email protected]>

yamlRestTest dependencies installed

8a1efcb

Signed-off-by: Anthony Leong <[email protected]>

ajleong623 added 2 commits September 2, 2025 11:46

Merge branch 'opensearch-project:main' into job-scheduler

827acdd

add changelog line

51de07a

Signed-off-by: Anthony Leong <[email protected]>

martin-gaievski reviewed Sep 8, 2025

View reviewed changes

scheduled experiment concurrency with timeout and cleanup is now ready

df91219

Signed-off-by: Anthony Leong <[email protected]>

fen-qin reviewed Sep 15, 2025

View reviewed changes

added tests for concurrency, timeout mechanism, and also async timeou…

66b5489

…t for async tasks Signed-off-by: Anthony Leong <[email protected]>

epugh reviewed Sep 17, 2025

View reviewed changes

src/main/java/org/opensearch/searchrelevance/plugin/SearchRelevancePlugin.java Outdated Show resolved Hide resolved

epugh reviewed Sep 17, 2025

View reviewed changes

src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobParameters.java Outdated Show resolved Hide resolved

ajleong623 added 2 commits September 17, 2025 11:54

add more comments and documentations

ed511ac

Signed-off-by: Anthony Leong <[email protected]>

cleaned up deleted job scheduled

031b97b

Signed-off-by: Anthony Leong <[email protected]>

epugh added v3.4.0 and removed v3.3.0 labels Sep 19, 2025

ajleong623 and others added 13 commits September 20, 2025 12:07

added scheduled parameter to experiment and cleanup scheduled experim…

f19c375

…ents on underlying experiment deletion Signed-off-by: Anthony Leong <[email protected]>

help fix forbidden apis

9941bd7

Signed-off-by: Anthony Leong <[email protected]>

retry tests

7aa1c32

Signed-off-by: Anthony Leong <[email protected]>

Merge branch 'opensearch-project:main' into job-scheduler

e2cc6c1

Couple of text tweaks...

1a404f0

increase timeout to one hour for production

fc9ab7f

Signed-off-by: Anthony Leong <[email protected]>

Merge branch 'job-scheduler' of https://github.com/ajleong623/search-…

7ef0e60

…relevance into job-scheduler

fix timeout test

eaabfd1

Signed-off-by: Anthony Leong <[email protected]>

update action names

3e514e4

Signed-off-by: Anthony Leong <[email protected]>

reenable neural search

24852eb

Signed-off-by: Anthony Leong <[email protected]>

update scheduled run id value

dbf5f73

Signed-off-by: Anthony Leong <[email protected]>

reenable ml plugin

bfc13cc

Signed-off-by: Anthony Leong <[email protected]>

cleanup unused constants

1a27b3d

Signed-off-by: Anthony Leong <[email protected]>

epugh reviewed Oct 9, 2025

View reviewed changes

	* @param listener - action lister for async operation
	* @param listener - action listener for async operation

Scheduler for regular search evaluation runs #220

Are you sure you want to change the base?

Scheduler for regular search evaluation runs #220

Uh oh!

Conversation

ajleong623 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Uh oh!

epugh commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

epugh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajleong623 commented Sep 1, 2025

Uh oh!

epugh commented Sep 2, 2025

Uh oh!

martin-gaievski left a comment

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

martin-gaievski Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajleong623 commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajleong623 commented Sep 9, 2025

Uh oh!

martin-gaievski commented Sep 11, 2025

Uh oh!

fen-qin Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajleong623 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

epugh Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

epugh Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

ajleong623 commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

ajleong623 commented Aug 12, 2025 •

edited

Loading

epugh commented Aug 13, 2025 •

edited

Loading

ajleong623 Sep 9, 2025 •

edited

Loading

ajleong623 commented Sep 9, 2025 •

edited

Loading

ajleong623 Sep 15, 2025 •

edited

Loading

ajleong623 commented Sep 17, 2025 •

edited

Loading