-
Notifications
You must be signed in to change notification settings - Fork 21
Scheduler for regular search evaluation runs #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Post discussion with @wrigleyDan and @epugh we are going to change direction a bit and make the API take in an ALREADY EXISTING Experiment ID, and use that (and it's associated settings) to run the experiment every iteration. Let's move to a cron pattern versus a interval. We need to think about if we need a limit to how many experiments can be run... |
…jobs index Signed-off-by: Anthony Leong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress! We are now on the cron pattern. Now to think about nesting the API under the /experiment/{experiment_id}/schedule name space.
src/main/java/org/opensearch/searchrelevance/common/PluginConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/PluginConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/PluginConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/common/PluginConstants.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPostScheduledExperimentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestDeleteScheduledExperimentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/rest/RestPutExperimentAction.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobRunner.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobRunner.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Co-authored-by: Eric Pugh <[email protected]> Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
This reverts commit 7f6352d. Signed-off-by: Anthony Leong <[email protected]>
96e8016
to
fd62b14
Compare
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
I believe I have addressed the comments. One of them, I did add a TODO comment so that it can be addressed in the future. Right now, the solution to refactoring the logic of running experiments is a bit involved. |
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
You now just need to add soemthing to highlight this new Feature in the change log! https://github.com/opensearch-project/search-relevance/blob/main/CHANGELOG.md#features |
Signed-off-by: Anthony Leong <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing comments. Some area still need improvements:
- empty finally blocks - no resource cleanup implemented
- nested async operations lack individual timeouts
- no mechanism to cancel in-progress experiments during timeout
FutureUtils.cancel(searchEvaluationTask); // Attempt to interrupt the running task | ||
} catch (InterruptedException | ExecutionException e) { | ||
log.error("Interrupt for scheduled experiment has occured!"); | ||
} finally {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're missing resource cleanup in finally block. yopu need to do something like this:
if (currentExperimentTask != null && !currentExperimentTask.isDone()) {
currentExperimentTask.cancel(true);
}
manager.cleanupResources(parameter.getExperimentId());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I fixed this issue. For handling cleanups, I update the experiment result that was registered to the new async status of TIMEOUT.
// TODO: A lot of the logic here is reused from PutTransportExperiment. | ||
// Eventually we have to abstract it in another class to reduce complexity. | ||
|
||
Runnable runnable = () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can implement timeout with a simple future wrapper task, something like:
public <T> CompletableFuture<T> withTimeout(CompletableFuture<T> future, long timeoutSeconds) {
CompletableFuture<T> timeoutFuture = new CompletableFuture<>();
ScheduledFuture<?> timeout = scheduler.schedule(() -> {
if (timeoutFuture.cancel(false)) {
future.cancel(true);
}
}, timeoutSeconds, TimeUnit.SECONDS);
// complete when original completes
future.whenComplete((result, throwable) -> {
timeout.cancel(false); // Cancel timeout
if (throwable == null) {
timeoutFuture.complete(result);
} else {
timeoutFuture.completeExceptionally(throwable);
}
});
return timeoutFuture;
}
// If any of the futures fails, the exception would be handled | ||
// in the logic of that future. Therefore, no action for failure | ||
// is necessary here. | ||
CompletableFuture.allOf(configFutures.toArray(new CompletableFuture[0])).join(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like here you don't have timeout control for individual operations, this could lead to resource leaks in high-load scenarios: main task will be interrupted but nested task will keep running. You can solve it by adding timeout wrapper to all async operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the timeout wrapper. The only issue is when I looked at FutureUtils, the API does not allow threads to be interrupted. However, not interrupting the threads defeats the purpose of the timeout.
Signed-off-by: Anthony Leong <[email protected]>
@martin-gaievski one last issue is that cancelled timeouts with thread interruption is not allowed. Does this mean that my solution would have to involve adding an atomic boolean that denotes whether the task was cancelled and then checking it at key points? This might not be sufficient in cancelling long running calculations such as the hybrid optimizer one. Additionally, I think would have to make sure all background tasks such as the asynchronous point wise experiment processing are cancelled manually. Please let me know of any feedback on the ideas I mentioned. |
Just a personal reminder to add documentation comments throughout code changes before submitting. |
@ajleong623, excellent question about the thread interruption limitations. You're absolutely right to be concerned about this, especially for long-running operations like the hybrid optimizer. Let me provide some guidance on the best approach here. start from introducing the cancellation token approach/pattern: public class ExperimentCancellationToken {
private final AtomicBoolean cancelled = new AtomicBoolean(false);
private final List<Runnable> cancellationCallbacks = new CopyOnWriteArrayList<>();
public boolean isCancelled() {
return cancelled.get();
}
public void cancel() {
if (cancelled.compareAndSet(false, true)) {
cancellationCallbacks.forEach(Runnable::run);
}
}
public void onCancel(Runnable callback) {
cancellationCallbacks.add(callback);
if (isCancelled()) {
callback.run();
}
}
}
For the hybrid optimizer and other long-running calculations, you'll need to add cancellation check points at strategic locations: // In HybridOptimizerExperimentProcessor
public void processHybridOptimizerExperiment(..., ExperimentCancellationToken cancellationToken) {
for (String queryText : queryTexts) {
// check before each query processing
if (cancellationToken.isCancelled()) {
handleCancellation();
return;
}
for (SearchConfiguration config : configurations) {
if (cancellationToken.isCancelled()) {
handleCancellation();
return;
}
// process configuration...
}
}
}
- special manual handling for asyn operations
async operations like pointwise experiments, may need manual cancellation. I suggest:
```java
// part ExperimentRunningManager
private final Map<String, List<CompletableFuture<?>>> runningFutures = new ConcurrentHashMap<>();
public void startExperimentRun(String experimentId, PutExperimentRequest request, ExperimentCancellationToken token) {
List<CompletableFuture<?>> futures = new ArrayList<>();
// register cancellation callback
token.onCancel(() -> {
futures.forEach(f -> f.cancel(false));
runningFutures.remove(experimentId);
});
// track all async operations
CompletableFuture<QuerySet> querySetFuture = fetchQuerySetAsync(...);
futures.add(querySetFuture);
runningFutures.put(experimentId, futures);
// continue experiment...
}
public static <T> CompletableFuture<T> withTimeout(
CompletableFuture<T> future,
long timeoutSeconds,
ThreadPool threadPool,
ExperimentCancellationToken cancellationToken) {
CompletableFuture<T> timeoutFuture = new CompletableFuture<>();
ScheduledFuture<?> timeout = threadPool.scheduler().schedule(() -> {
cancellationToken.cancel(); // Signal cancellation instead of interrupting
timeoutFuture.completeExceptionally(new TimeoutException());
}, timeoutSeconds, TimeUnit.SECONDS);
...
} Some other important considerations:
One alternative to cancellation token approach is chucking. For very long-running operation we may want to break it into smaller chunks. To me this is more complex then tokens, mainly due to uncertainty of how exactly break the task and how to handle partial results. Overall, your proposed solution with |
|
||
ScheduledFuture<?> timeout = threadPool.scheduler().schedule(() -> { | ||
if (timeoutFuture.cancel(false)) { | ||
future.cancel(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please resolve the build failure
Forbidden method invocation: java.util.concurrent.Future#cancel(boolean) [Don't interrupt threads use FutureUtils#cancel(Future<T>) instead]
sample code:
// Before:
// future.cancel(true);
// After:
FutureUtils.cancel(future);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I looked into the FutureUtils
implementation and noticed that the thread cannot be interrupted on cancel
which means the long-running task might still be running after cancel. Martin and I have been working on a workaround for that issue, and I will use the proper api.
…t for async tasks Signed-off-by: Anthony Leong <[email protected]>
@martin-gaievski In In In In For cleanup, I handled each case similarly to how failures detections are handled. However, in the final cleanup in I also have not handled partial results, but it will be null if timeout occurs Let me know if you have any questions or comments. I need another look because I have been working on this for a while, and my brain is currently fried. |
src/main/java/org/opensearch/searchrelevance/plugin/SearchRelevancePlugin.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/searchrelevance/scheduler/SearchRelevanceJobParameters.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
…ents on underlying experiment deletion Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
…relevance into job-scheduler
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
Signed-off-by: Anthony Leong <[email protected]>
/** | ||
* List scheduled jobs by source builder | ||
* @param sourceBuilder - source builder to be searched | ||
* @param listener - action lister for async operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param listener - action lister for async operation | |
* @param listener - action listener for async operation |
import lombok.extern.log4j.Log4j2; | ||
|
||
/** | ||
* ExperimentRunningManager helps isolate the logic for running the logic in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight awk phrasing.
@martin-gaievski @fen-qin I think I am ready for the next round of code reviews as I believe I addressed the comments mentioned prior. Please let me know about any other suggestions or concerns. |
Description
For requesting a regularly scheduled search evaluation, the user could add an
cron
parameter to denote the cron job schedule for running search evaluation.Some changes that are made are that there are now 3 new APIs for interacting with scheduling experiments. The endpoints are
experiment/<job_id>/schedule
which is applied to the GET and DELETE methods andexperiment/schedule
which is applied to the GET and POST methods.There are 2 new indices,
.scheduled-jobs
andsearch-relevance-scheduled-experiment-history
. The purpose of the.scheduled-jobs
index is to store the currently running experiment schedules. Thesearch-relevance-scheduled-experiment-history
index stores the historical experiment results with timestamps which were resulted from the scheduled job runner.Unit and integration tests are provided, however, additions such as workload management, integration with alerting and resource monitoring are not available in this pull request, but I would like to add those into a future pull request.
Please let me know if there are any questions or concerns.
Issues Resolved
#213 #226
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.