-
Notifications
You must be signed in to change notification settings - Fork 21
Add Import Capability for externally run Search Evaluation #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Import Capability for externally run Search Evaluation #120
Conversation
38ce2ce
to
0ce542a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this implementation is unnecessary overcomplicate things. In follows closely the sequence from PUT action, but there we actually run the experiment. For POST action logic should be much simpler.
At the high level these are over engineered steps/areas:
public class PostExperimentTransportAction {
// Creates experiment with PROCESSING status
Experiment initialExperiment = new Experiment(
id, timestamp, type, AsyncStatus.PROCESSING, // WHY?
querySetId, searchConfigurationList, judgmentList, size, new ArrayList<>()
);
// Background async processing that's not needed
private void triggerAsyncProcessing(String experimentId, PostExperimentRequest request) {
// This should be synchronous!
}
// Complex atomic counters for tracking "pending" operations
AtomicInteger pendingQueries = new AtomicInteger(request.getEvaluationResultList().size());
AtomicBoolean hasFailure = new AtomicBoolean(false);
// Multiple status updates PROCESSING → COMPLETED
private void updateFinalExperiment(...) {
// Just create it as COMPLETED from the start!
}
}
you can improve the implementation and follow this sequence:
protected void doExecute() {
try {
// Validate input (synchronously)
validateRequest(request);
// Create experiment as COMPLETED (data already processed)
String experimentId = UUID.randomUUID().toString();
List<Map<String, Object>> results = new ArrayList<>();
// Persist evaluation results (synchronously)
for (Map<String, Object> evalData : request.getEvaluationResultList()) {
EvaluationResult result = createEvaluationResult(evalData, request);
evaluationResultDao.putEvaluationResultSync(result); // Sync call
results.add(createResultSummary(result));
}
// Create final experiment with COMPLETED status
Experiment experiment = new Experiment(
experimentId, TimeUtils.getTimestamp(), request.getType(),
AsyncStatus.COMPLETED, // Already complete!
request.getQuerySetId(), request.getSearchConfigurationList(),
request.getJudgmentList(), request.getSize(), results
);
// Persist and return
experimentDao.putExperiment(experiment, listener);
} catch (Exception e) {
listener.onFailure(new SearchRelevanceException("Import failed", e, RestStatus.BAD_REQUEST));
}
}
|
||
try { | ||
String id = UUID.randomUUID().toString(); | ||
LOGGER.warn("Experiment ID: " + id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn is a bit too much, info or debug suits better here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, doing it now.
XContentParser parser = request.contentParser(); | ||
Map<String, Object> source = parser.map(); | ||
|
||
String querySetId = (String) source.get("querySetId"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add validation for missing or blank values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done...
@sstults can you tackle the feedback from @martin-gaievski on this one when you are back in the office? |
@sstults I went through again, and yes, I think @martin-gaievski flagged some other items seperate from your previous PR. Also, I added you as a collaborator to |
Still working on the first comment, but addressed the second two. |
@martin-gaievski I believe I've addressed the issues found in the review. Could you re-review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to few comments below one missing thing is proper integration test. it can be a simple happy path but for new API we need to have it.
Because it's a new API we need to have an app sec review and we cannot merge to main branch before is signed off by security team. We can only merge to feature branch. Do you have proper design doc and threat model in order to start the app sec process?
UPD: As per later discussion let's focus on addressing code level comments (adding validations, tests etc.). Once this is completed PR should be good to merge and design and other technical details can be addressed in parallel. I've created PR with template of the technical document we'll be looking for #201, it's still in review but I don't expect any fundamental level requests as part of PR review.
^ @epugh @sstults
|
||
@Override | ||
public ActionRequestValidationException validate() { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to have validations for all required fields, something like:
ActionRequestValidationException validationException = null;
if (type == null) {
validationException = addValidationError("type is required", validationException);
}
if (querySetId == null || querySetId.isEmpty()) {
validationException = addValidationError("querySetId is required", validationException);
}
if (searchConfigurationList == null || searchConfigurationList.isEmpty()) {
validationException = addValidationError("searchConfigurationList is required", validationException);
}
if (evaluationResultList == null || evaluationResultList.isEmpty()) {
validationException = addValidationError("evaluationResultList is required", validationException);
}
return validationException;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine situations where we don't actually need htings like a SearchConfiguration, however I bet our frontend UI will blow up, so we should require them until we have a strong use case to NOT require them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added checks for empty values, but I'm leaving the nullness check to the annotations.
List<Map<String, Object>> results = new ArrayList<>(); | ||
|
||
// Persist evaluation results (synchronously) | ||
for (Map<String, Object> evalData : request.getEvaluationResultList()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without checking for a limit this loop is potentially dangerous, each result it written synchronously potentially blocking transport thread, and we don't have any timeout protection.
My recommendations:
- add a limit for number of elements in the list
- add timeout (optional)
- implement queue-based approach for large imports (super optional)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think reference SEARCH_RELEVANCE_QUERY_SET_MAX_LIMIT (and fix JavaDocs!)
); | ||
|
||
// Persist and return | ||
experimentDao.putExperiment(experiment, listener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the behavior in case last out of the 100 evaluation results has failed? I assume 99 evaluation records will be stored, last evaluation record is skipped, and experiment document is skipped too. We should do better in this case. Unfortunately we cannot do transactional semantic, or it's rather very hard to implement. But because we saved some of evaluation documents we need to store the experiment document with status ERROR instead of COMPLETED.
I don't think we should block this PR on a sign-off from a security team separate from the OpenSearch Project. Many companies use OpenSearch, and they will all have different security requirements and processes. It's not tenable to put any of those processes into the critical path of committing code into open-source. That being said, I think it's totally fair game for the OpenSearch Project to implement certain standards and requirements related to security (e.g. new APIs must include a high level architecture and threat model as a part of design and implementation). I'm tagging @getsaurabh02 @msfroh from the build interest group as "compliance with security standards" is part of their charter. Do you have anything to add here? @martin-gaievski I'm a little concerned that this is something of an 11th hour requirement from @epugh's perspective. How can we get this merged before the 3.2 release? Are there any existing artifacts that Eric can use as an example for this case to compile in the next couple days? I believe this plugin was released for the first time in the 3.1 distribution and it introduced new APIs. What design docs/thread models were used for that case? I would think this case should just be a very small update to those existing artifacts. |
I think the most effective way to address requirements is to put those technical details and security related answers to RFC, there are many examples of those with RFC label, even in this repository: RFC1, RFC2, RFC3. Eric and team can have extended section regrading security issues as part of the RFC - what are potential risks (like limiting the type and size of user provided content), how they are addressed in current implementation, any new feature flags etc. |
@martin-gaievski If you have specific technical concerns like this, then as a maintainer you can and should ensure they are addressed before merging. Similarly if you have specific concerns about making this behavior available by default then you can propose to make it opt-in via a feature flag. My biggest concern is preventing code that is otherwise ready from being merged pending an opaque review from a specific company. |
That's what I put into my recent review, for instance this comment. |
I will take the template that was added: https://github.com/opensearch-project/search-relevance/blob/main/docs/DESIGN_TEMPLATE.md and go through it and add details to cover this PR. @martin-gaievski are we thinking this document go into the PR as text, or we create an issue with the text, or do we add a new doc in docs/PR-120-TECHNICAL-DESIGN.md? Or is it an RFC? What are we thinking the next steps are? |
In general, a new GitHub issue is the best option, which would look very much like an RFC. This approach makes sense for relatively big features with multiple components and user-facing changes. For smaller features or additions to existing features, this can be part of the feature request (if available at the time of creation, or added as a comment) or most realistically, part of the PR description. For this PR, because there appears to be no overall design documentation for Experiments in SRW (please correct me if I'm wrong and I've simply missed the relevant document), I believe we should create a combined RFC with:
To meet the upcoming 3.2 deadline, we can start by adding details related to the new import experiment API directly to this PR (I think most details are already there, just make sure everything is up to date). This should be sufficient to merge the PR. The broader RFC can be developed in parallel and shouldn't block this PR for the 3.2 release. |
The design for the Experiments was covered in the Search Relevance Workbench RFC. A lot of the questions that the design templates highlights though, like "Threat Analysis", are not covered at all in the RFC. I will take your suggestion about, for this PR, to modify the PR and add the details out of the DESIGN_TEMPLATE.md and see how it goes! |
Yes, Search Relevance Workbench RFC](opensearch-project/OpenSearch#17735) answers high level questions, thanks for sharing the link. Then update in this PR only should be sufficient |
@martin-gaievski I wanted to draw your attention the the seperate PR that @sstults linked with the design doc: #205. I know we talked about it being in the github issue as well.. How does this look towards getting this reviewed and merged? I'm aware of the narrowing window we have for getting this in for 3.2! @sstults also has promised to get those last review comments dealt with this weekend. |
Also, I think I am just realizing that I need two approvals in this repo for a merge? I'm going to tag more maintainers to get review.... |
I would even argue that if a company has an internal review process that needs to be followed before exposing a new feature that has been accepted into the project following the normal code review process, the onus should be on that company to leave the feature off until they do their review. That is, I would make it opt-out instead of opt-in. Of course, asking for the ability to easily disable the feature (maybe via a static node-level flag) is reasonable. Also, there's nothing stopping anyone from patching out a feature on their internal branch. |
88988f3
to
a46c34c
Compare
Would you like to rebase from current main branch ? |
Re-implementation of the work that @sstults did in opensearch-project#73 that uses a POST verb to set up the import instead of a ExperimentType.POINTWISE_EVALUATION_IMPORT. Signed-off-by: Eric Pugh <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
…experiment_results
Signed-off-by: Eric Pugh <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
I did some testing and what we import doesn't mirror what we generate when directly running an experiment. I am going to work on this PR for a bit but if it gets too confusing I may start yet another PR as. Fresh attempt. Ugh. |
Signed-off-by: Eric Pugh <[email protected]>
…experiment_results
Signed-off-by: Eric Pugh <[email protected]>
…W (and the import API ;-) )
@epugh
Because this feature aims to import external experiment results into SRW, so we'd like to evaluate the risk carefully.
Let me share the validation check items: 1. Number of Evaluation ResultsThe RestPostExperimentAction parses the entire input request body using parser.map(), which deserializes the whole JSON payload into memory. Large evaluate results can exhaust memory or processing resources, leading to denial-of-service (DoS) or node instability. Add explicit checks for evaluate results size either via OpenSearch settings or directly in the REST handler (e.g., check request.content().length()). take an example, we had introduce a settings for imported query set size, we can extend it to be more general, or make another size limit for external experiment results
2. Size of Individual FieldsThere is minimal validation for the size of individual fields (e.g., a single searchText, elements of metrics, document IDs, etc.). The REST layer does not check for maximum string lengths or map/list sizes within each result. Large individual fields (e.g., a searchText with megabytes of content, or extremely large metric arrays) can cause excessive memory or storage use, slow processing, or even failure in downstream DAO layers. Can you define and enforce maximum allowed sizes for key fields in the REST handler or the validate() method. take an example, we add TextValidationUtil for imported query set - https://github.com/opensearch-project/search-relevance/blob/main/src/main/java/org/opensearch/searchrelevance/utils/TextValidationUtil.java, we can extend it to be more general, or add more functions for external experiment results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I meant for this review to be an inline one. I have added the proper review below. Feel free to dismiss this comment.) What were you thinking for the behavior of has failure when putting an evaluation? Should the loop continue to add experiment results or would there be a short-circuit where null
results are returned?
updateFinalExperiment(experimentId, request, finalResults, judgmentList, hasFailure.get()); | ||
} | ||
}, error -> { | ||
hasFailure.set(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What were you thinking for the behavior of has failure when putting an evaluation? Should the loop continue to add experiment results or would there be a short-circuit where null results are returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think if anything fails, then we mark the import as failed. We don't really want a partial import, it wouldn't be useful. And it's not like we are loading data that is sooo big that you are like "oh good, I got 1 out of 3 terrabytes of data in, and now I fix my bug and start with the remainder"! If anything goes wrong, just set the whole thing to failed.
|
||
1) The `movies_queryset.json` was hand extracted. | ||
|
||
1) The `movies_queryset.json` was done using the Python script `extract_quepid_judgements.py`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that on lines 5 and 7, you mentioned that movies_queryset.json
was both hand extracted and done using the Python script
metrics = [] | ||
for doc in item.get("ratings", []): | ||
doc_id = doc.get("doc_id", "") | ||
rating = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rating object used for?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
===========================
===========================
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
This addresses Issue #46 which allows importing of documents and evaluation results.
This is a re-implementation of the work that @sstults did in #73 after the discussion about using the
POST
http verb. It also updates to the latest data structures.Issues Resolved
Closes #46
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.