Skip to content

Conversation

epugh
Copy link
Collaborator

@epugh epugh commented Jun 14, 2025

Description

This addresses Issue #46 which allows importing of documents and evaluation results.

This is a re-implementation of the work that @sstults did in #73 after the discussion about using the POST http verb. It also updates to the latest data structures.

Issues Resolved

Closes #46

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation is unnecessary overcomplicate things. In follows closely the sequence from PUT action, but there we actually run the experiment. For POST action logic should be much simpler.
At the high level these are over engineered steps/areas:

 public class PostExperimentTransportAction {
    
    // Creates experiment with PROCESSING status
    Experiment initialExperiment = new Experiment(
        id, timestamp, type, AsyncStatus.PROCESSING, // WHY?
        querySetId, searchConfigurationList, judgmentList, size, new ArrayList<>()
    );
    
    // Background async processing that's not needed
    private void triggerAsyncProcessing(String experimentId, PostExperimentRequest request) {
        // This should be synchronous!
    }
    
    // Complex atomic counters for tracking "pending" operations
    AtomicInteger pendingQueries = new AtomicInteger(request.getEvaluationResultList().size());
    AtomicBoolean hasFailure = new AtomicBoolean(false);
    
    // Multiple status updates PROCESSING → COMPLETED
    private void updateFinalExperiment(...) {
        // Just create it as COMPLETED from the start!
    }
}

you can improve the implementation and follow this sequence:

protected void doExecute() {
    try {
        // Validate input (synchronously)
        validateRequest(request);
        
        // Create experiment as COMPLETED (data already processed)
        String experimentId = UUID.randomUUID().toString();
        List<Map<String, Object>> results = new ArrayList<>();
        
        // Persist evaluation results (synchronously) 
        for (Map<String, Object> evalData : request.getEvaluationResultList()) {
            EvaluationResult result = createEvaluationResult(evalData, request);
            evaluationResultDao.putEvaluationResultSync(result); // Sync call
            results.add(createResultSummary(result));
        }
        
        // Create final experiment with COMPLETED status
        Experiment experiment = new Experiment(
            experimentId, TimeUtils.getTimestamp(), request.getType(),
            AsyncStatus.COMPLETED, // Already complete!
            request.getQuerySetId(), request.getSearchConfigurationList(),
            request.getJudgmentList(), request.getSize(), results
        );
        
        // Persist and return
        experimentDao.putExperiment(experiment, listener);
        
    } catch (Exception e) {
        listener.onFailure(new SearchRelevanceException("Import failed", e, RestStatus.BAD_REQUEST));
    }
}


try {
String id = UUID.randomUUID().toString();
LOGGER.warn("Experiment ID: " + id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warn is a bit too much, info or debug suits better here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, doing it now.

XContentParser parser = request.contentParser();
Map<String, Object> source = parser.map();

String querySetId = (String) source.get("querySetId");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add validation for missing or blank values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done...

@epugh
Copy link
Collaborator Author

epugh commented Jul 21, 2025

@sstults can you tackle the feedback from @martin-gaievski on this one when you are back in the office?

@epugh
Copy link
Collaborator Author

epugh commented Jul 23, 2025

@sstults I went through again, and yes, I think @martin-gaievski flagged some other items seperate from your previous PR. Also, I added you as a collaborator to epugh/search-relevance so you can update the PR directly.

@sstults
Copy link

sstults commented Jul 24, 2025

Still working on the first comment, but addressed the second two.

@sstults
Copy link

sstults commented Jul 28, 2025

@martin-gaievski I believe I've addressed the issues found in the review. Could you re-review?

Copy link
Member

@martin-gaievski martin-gaievski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to few comments below one missing thing is proper integration test. it can be a simple happy path but for new API we need to have it.
Because it's a new API we need to have an app sec review and we cannot merge to main branch before is signed off by security team. We can only merge to feature branch. Do you have proper design doc and threat model in order to start the app sec process?

UPD: As per later discussion let's focus on addressing code level comments (adding validations, tests etc.). Once this is completed PR should be good to merge and design and other technical details can be addressed in parallel. I've created PR with template of the technical document we'll be looking for #201, it's still in review but I don't expect any fundamental level requests as part of PR review.
^ @epugh @sstults


@Override
public ActionRequestValidationException validate() {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to have validations for all required fields, something like:

    ActionRequestValidationException validationException = null;
    
    if (type == null) {
        validationException = addValidationError("type is required", validationException);
    }
    if (querySetId == null || querySetId.isEmpty()) {
        validationException = addValidationError("querySetId is required", validationException);
    }
    if (searchConfigurationList == null || searchConfigurationList.isEmpty()) {
        validationException = addValidationError("searchConfigurationList is required", validationException);
    }
    if (evaluationResultList == null || evaluationResultList.isEmpty()) {
        validationException = addValidationError("evaluationResultList is required", validationException);
    }
    
    return validationException;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine situations where we don't actually need htings like a SearchConfiguration, however I bet our frontend UI will blow up, so we should require them until we have a strong use case to NOT require them!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added checks for empty values, but I'm leaving the nullness check to the annotations.

List<Map<String, Object>> results = new ArrayList<>();

// Persist evaluation results (synchronously)
for (Map<String, Object> evalData : request.getEvaluationResultList()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without checking for a limit this loop is potentially dangerous, each result it written synchronously potentially blocking transport thread, and we don't have any timeout protection.
My recommendations:

  • add a limit for number of elements in the list
  • add timeout (optional)
  • implement queue-based approach for large imports (super optional)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think reference SEARCH_RELEVANCE_QUERY_SET_MAX_LIMIT (and fix JavaDocs!)

);

// Persist and return
experimentDao.putExperiment(experiment, listener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behavior in case last out of the 100 evaluation results has failed? I assume 99 evaluation records will be stored, last evaluation record is skipped, and experiment document is skipped too. We should do better in this case. Unfortunately we cannot do transactional semantic, or it's rather very hard to implement. But because we saved some of evaluation documents we need to store the experiment document with status ERROR instead of COMPLETED.

@andrross
Copy link
Member

Because it's a new API we need to have an app sec review and we cannot merge to main branch before is signed off by security team. We can only merge to feature branch. Do you have proper design doc and threat model in order to start the app sec process?

I don't think we should block this PR on a sign-off from a security team separate from the OpenSearch Project. Many companies use OpenSearch, and they will all have different security requirements and processes. It's not tenable to put any of those processes into the critical path of committing code into open-source.

That being said, I think it's totally fair game for the OpenSearch Project to implement certain standards and requirements related to security (e.g. new APIs must include a high level architecture and threat model as a part of design and implementation). I'm tagging @getsaurabh02 @msfroh from the build interest group as "compliance with security standards" is part of their charter. Do you have anything to add here?

@martin-gaievski I'm a little concerned that this is something of an 11th hour requirement from @epugh's perspective. How can we get this merged before the 3.2 release? Are there any existing artifacts that Eric can use as an example for this case to compile in the next couple days? I believe this plugin was released for the first time in the 3.1 distribution and it introduced new APIs. What design docs/thread models were used for that case? I would think this case should just be a very small update to those existing artifacts.

@martin-gaievski
Copy link
Member

martin-gaievski commented Jul 30, 2025

Because it's a new API we need to have an app sec review and we cannot merge to main branch before is signed off by security team. We can only merge to feature branch. Do you have proper design doc and threat model in order to start the app sec process?

I don't think we should block this PR on a sign-off from a security team separate from the OpenSearch Project. Many companies use OpenSearch, and they will all have different security requirements and processes. It's not tenable to put any of those processes into the critical path of committing code into open-source.

That being said, I think it's totally fair game for the OpenSearch Project to implement certain standards and requirements related to security (e.g. new APIs must include a high level architecture and threat model as a part of design and implementation). I'm tagging @getsaurabh02 @msfroh from the build interest group as "compliance with security standards" is part of their charter. Do you have anything to add here?

@martin-gaievski I'm a little concerned that this is something of an 11th hour requirement from @epugh's perspective. How can we get this merged before the 3.2 release? Are there any existing artifacts that Eric can use as an example for this case to compile in the next couple days? I believe this plugin was released for the first time in the 3.1 distribution and it introduced new APIs. What design docs/thread models were used for that case? I would think this case should just be a very small update to those existing artifacts.

I think the most effective way to address requirements is to put those technical details and security related answers to RFC, there are many examples of those with RFC label, even in this repository: RFC1, RFC2, RFC3. Eric and team can have extended section regrading security issues as part of the RFC - what are potential risks (like limiting the type and size of user provided content), how they are addressed in current implementation, any new feature flags etc.
Regarding the features of search-relevance that were released in 3.1 - there are RFCs with those details for some of the features: ref1, ref2, ref3. I'm not sure if you aware of this, but most content of this repository got migrated from o19s repository close to 3.1 code freeze date ref. The the repo has been deleted after the migration, and that is including all GH Issues that may have those technical details. In 3.1 all features were behind the feature flag with "opt in" mode, partially because of in_process security review (ref). It has been flipped to "opt out" only recently, after all security concerns were addressed.

@andrross
Copy link
Member

what are potential risks (like limiting the type and size of user provided content)

@martin-gaievski If you have specific technical concerns like this, then as a maintainer you can and should ensure they are addressed before merging. Similarly if you have specific concerns about making this behavior available by default then you can propose to make it opt-in via a feature flag. My biggest concern is preventing code that is otherwise ready from being merged pending an opaque review from a specific company.

@martin-gaievski
Copy link
Member

what are potential risks (like limiting the type and size of user provided content)

@martin-gaievski If you have specific technical concerns like this, then as a maintainer you can and should ensure they are addressed before merging. Similarly if you have specific concerns about making this behavior available by default then you can propose to make it opt-in via a feature flag. My biggest concern is preventing code that is otherwise ready from being merged pending an opaque review from a specific company.

That's what I put into my recent review, for instance this comment.
Looking at the SRW experiments in general, I cannot find any document that describes the design, events flow and other similar details (@epugh or @sstults, correct me if I'm wrong about this). That is my main reason for requesting GH issue (or similar doc). While this PR is extending workflow for experiments in SRW, it's hard to understand the implications of this change without seeing the whole picture.
I want to clarify that my intention is not to prevent code from being merged unnecessarily. Rather, my concerns are focused on ensuring that the code being integrated is ready and of high quality. As maintainers, our responsibility is to keep the system stable and free from regressions. Having proper documentation and understanding the full design helps us achieve this goal. I believe that addressing these technical concerns through collaborative discussion will lead to a better product for all users.

@epugh
Copy link
Collaborator Author

epugh commented Jul 31, 2025

I will take the template that was added: https://github.com/opensearch-project/search-relevance/blob/main/docs/DESIGN_TEMPLATE.md and go through it and add details to cover this PR. @martin-gaievski are we thinking this document go into the PR as text, or we create an issue with the text, or do we add a new doc in docs/PR-120-TECHNICAL-DESIGN.md? Or is it an RFC? What are we thinking the next steps are?

@martin-gaievski
Copy link
Member

I will take the template that was added: https://github.com/opensearch-project/search-relevance/blob/main/docs/DESIGN_TEMPLATE.md and go through it and add details to cover this PR. @martin-gaievski are we thinking this document go into the PR as text, or we create an issue with the text, or do we add a new doc in docs/PR-120-TECHNICAL-DESIGN.md? Or is it an RFC? What are we thinking the next steps are?

In general, a new GitHub issue is the best option, which would look very much like an RFC. This approach makes sense for relatively big features with multiple components and user-facing changes. For smaller features or additions to existing features, this can be part of the feature request (if available at the time of creation, or added as a comment) or most realistically, part of the PR description.

For this PR, because there appears to be no overall design documentation for Experiments in SRW (please correct me if I'm wrong and I've simply missed the relevant document), I believe we should create a combined RFC with:

  • An overall overview of the Experiment flow in SRW (with fewer details, primarily for context to understand why we need the import API)
  • Details about the functionality added in this PR

To meet the upcoming 3.2 deadline, we can start by adding details related to the new import experiment API directly to this PR (I think most details are already there, just make sure everything is up to date). This should be sufficient to merge the PR. The broader RFC can be developed in parallel and shouldn't block this PR for the 3.2 release.

@epugh
Copy link
Collaborator Author

epugh commented Jul 31, 2025

The design for the Experiments was covered in the Search Relevance Workbench RFC. A lot of the questions that the design templates highlights though, like "Threat Analysis", are not covered at all in the RFC.

I will take your suggestion about, for this PR, to modify the PR and add the details out of the DESIGN_TEMPLATE.md and see how it goes!

@martin-gaievski
Copy link
Member

The design for the Experiments was covered in the Search Relevance Workbench RFC. A lot of the questions that the design templates highlights though, like "Threat Analysis", are not covered at all in the RFC.

I will take your suggestion about, for this PR, to modify the PR and add the details out of the DESIGN_TEMPLATE.md and see how it goes!

Yes, Search Relevance Workbench RFC](opensearch-project/OpenSearch#17735) answers high level questions, thanks for sharing the link. Then update in this PR only should be sufficient

@epugh
Copy link
Collaborator Author

epugh commented Aug 2, 2025

@martin-gaievski I wanted to draw your attention the the seperate PR that @sstults linked with the design doc: #205. I know we talked about it being in the github issue as well.. How does this look towards getting this reviewed and merged? I'm aware of the narrowing window we have for getting this in for 3.2! @sstults also has promised to get those last review comments dealt with this weekend.

@epugh epugh requested a review from wrigleyDan August 2, 2025 11:00
@epugh
Copy link
Collaborator Author

epugh commented Aug 2, 2025

Also, I think I am just realizing that I need two approvals in this repo for a merge? I'm going to tag more maintainers to get review....

@epugh epugh requested review from fen-qin and heemin32 August 2, 2025 11:03
@msfroh
Copy link

msfroh commented Aug 2, 2025

Similarly if you have specific concerns about making this behavior available by default then you can propose to make it opt-in via a feature flag.

I would even argue that if a company has an internal review process that needs to be followed before exposing a new feature that has been accepted into the project following the normal code review process, the onus should be on that company to leave the feature off until they do their review. That is, I would make it opt-out instead of opt-in.

Of course, asking for the ability to easily disable the feature (maybe via a static node-level flag) is reasonable. Also, there's nothing stopping anyone from patching out a feature on their internal branch.

@sstults sstults force-pushed the use_post_to_import_experiment_results branch from 88988f3 to a46c34c Compare August 2, 2025 18:55
@fen-qin
Copy link
Collaborator

fen-qin commented Aug 2, 2025

Would you like to rebase from current main branch ?

epugh added 3 commits August 2, 2025 18:24
Re-implementation of the work that @sstults did in opensearch-project#73 that uses a POST verb to set up the import instead of a ExperimentType.POINTWISE_EVALUATION_IMPORT.

Signed-off-by: Eric Pugh <[email protected]>
Signed-off-by: Eric Pugh <[email protected]>
@epugh
Copy link
Collaborator Author

epugh commented Aug 25, 2025

I did some testing and what we import doesn't mirror what we generate when directly running an experiment. I am going to work on this PR for a bit but if it gets too confusing I may start yet another PR as. Fresh attempt. Ugh.

@fen-qin
Copy link
Collaborator

fen-qin commented Sep 19, 2025

@epugh
Can you confirm that this PR is up to date and ready for final review. since you mentioned

I may start yet another PR as. Fresh attempt

Because this feature aims to import external experiment results into SRW, so we'd like to evaluate the risk carefully.

  • There are risks regarding resource exhaustion due to missing payload and field-size validation.
  • there are not enough restrictions/validations for external experiment results in the current code, especially at the API boundary.
  • would you like to add explicit limits/checks for payload size, result count, and field size to improve robustness and security.

Let me share the validation check items:

1. Number of Evaluation Results

The RestPostExperimentAction parses the entire input request body using parser.map(), which deserializes the whole JSON payload into memory. Large evaluate results can exhaust memory or processing resources, leading to denial-of-service (DoS) or node instability. Add explicit checks for evaluate results size either via OpenSearch settings or directly in the REST handler (e.g., check request.content().length()).

take an example, we had introduce a settings for imported query set size, we can extend it to be more general, or make another size limit for external experiment results

int maxLimit = settingsAccessor.getMaxQuerySetAllowed();
int processCount = Math.min(evaluationResultList.size(), maxLimit);

2. Size of Individual Fields

There is minimal validation for the size of individual fields (e.g., a single searchText, elements of metrics, document IDs, etc.). The REST layer does not check for maximum string lengths or map/list sizes within each result. Large individual fields (e.g., a searchText with megabytes of content, or extremely large metric arrays) can cause excessive memory or storage use, slow processing, or even failure in downstream DAO layers. Can you define and enforce maximum allowed sizes for key fields in the REST handler or the validate() method.

take an example, we add TextValidationUtil for imported query set - https://github.com/opensearch-project/search-relevance/blob/main/src/main/java/org/opensearch/searchrelevance/utils/TextValidationUtil.java, we can extend it to be more general, or add more functions for external experiment results

Copy link
Contributor

@ajleong623 ajleong623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I meant for this review to be an inline one. I have added the proper review below. Feel free to dismiss this comment.) What were you thinking for the behavior of has failure when putting an evaluation? Should the loop continue to add experiment results or would there be a short-circuit where null results are returned?

updateFinalExperiment(experimentId, request, finalResults, judgmentList, hasFailure.get());
}
}, error -> {
hasFailure.set(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What were you thinking for the behavior of has failure when putting an evaluation? Should the loop continue to add experiment results or would there be a short-circuit where null results are returned?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if anything fails, then we mark the import as failed. We don't really want a partial import, it wouldn't be useful. And it's not like we are loading data that is sooo big that you are like "oh good, I got 1 out of 3 terrabytes of data in, and now I fix my bug and start with the remainder"! If anything goes wrong, just set the whole thing to failed.


1) The `movies_queryset.json` was hand extracted.

1) The `movies_queryset.json` was done using the Python script `extract_quepid_judgements.py`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that on lines 5 and 7, you mentioned that movies_queryset.json was both hand extracted and done using the Python script

metrics = []
for doc in item.get("ratings", []):
doc_id = doc.get("doc_id", "")
rating = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rating object used for?

@epugh epugh added v3.4.0 and removed v3.3.0 labels Sep 30, 2025
Copy link

codecov bot commented Sep 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.00%. Comparing base (b4ded15) to head (e3f1d90).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@     Coverage Diff     @@
##   main   #120   +/-   ##
===========================
===========================

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]Add Import Capability for externally run Search Evaluation

9 participants