Skip to content

Conversation

night-owl-1709
Copy link
Contributor

@night-owl-1709 night-owl-1709 commented Aug 3, 2025

Description

This PR adds comprehensive integration tests to validate OpenSearch's ability to handle documents with mixed vector and non-vector fields across multiple engines (FAISS, Lucene) and query types (k-NN, script scoring, filters, etc). The implementation focuses on functional validation and performance testing with optimised test code and robust error handling.

Components:

  • MixedVectorDocumentIT.java: Comprehensive functional and performance validations with enhanced error handling and logging

MixedVectorDocumentIT.java – Integration Tests

Feature/Test Description
Mixed Document Handling Inserts both vector and non-vector docs; ensures proper query handling with unique document IDs
k-NN Search Validates results only from vector-bearing docs across engines with comprehensive result validation
Script Score Query Uses cosine similarity in script scoring with performance validation and proper error handling
Filtered k-NN Combines term filters with vector scoring queries with category-based validation
Exists Query Ensures only documents with vector field are counted with precise count validation
Terms Aggregation Checks categorical aggregation from vector-indexed documents with error handling
_forcemerge Ensures correct behaviour after segments are merged with comprehensive validation
Engine Variants Runs across FAISS and Lucene engines with engine-optimised HNSW parameters
Space Types Validates L2 and Cosine Similarity (excludes Inner Product for Lucene compatibility)
Disk-Based Index FAISS tests include force-merge optimisation with proper settings management
Batch Processing Efficient batch document insertion with unique ID ranges to prevent conflicts

Query Types Covered

  • knn queries with multiple k values and result validation
  • script_score with cosine similarity and score verification
  • exists field validation with count assertions
  • term + bool filtered queries with category validation
  • terms_aggregation on categorical fields with proper error handling

Engines & Configuration

Engine Space Types Notes
FAISS L2, Cosine Dense vector, disk-based, force-merge testing with optimised parameters
Lucene L2, Cosine Native vector scoring with memory optimisation (excludes Inner Product for compatibility)

Related Issues

Implements #2284

Check List

  • New functionality includes testing
  • New functionality has been documented
  • API changes companion pull request (N/A - test-only changes)
  • Commits are signed per the DCO using --signoff
  • CHANGELOG.md updated with test additions
  • Public documentation issue/PR (N/A - internal testing improvements)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

For more information on following the Developer Certificate of Origin and signing off your commits, please check here.

@night-owl-1709 night-owl-1709 force-pushed the mixedvectorITFIx branch 2 times, most recently from c2fced7 to 2dc3ebc Compare August 5, 2025 13:02
@night-owl-1709 night-owl-1709 changed the title Issue #2284: Added IT and BWC tests for mixed vector documents Issue #2284: Added Integration Tests for mixed vector documents Aug 5, 2025
Copy link
Member

@kotwanikunal kotwanikunal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

navneet1v
navneet1v previously approved these changes Aug 7, 2025
@navneet1v
Copy link
Collaborator

@night-owl-1709 can you please resolve the conflicts

@night-owl-1709
Copy link
Contributor Author

@night-owl-1709 can you please resolve the conflicts

@navneet1v done

@night-owl-1709 night-owl-1709 force-pushed the mixedvectorITFIx branch 2 times, most recently from 2d6344e to 9b0a3b0 Compare August 12, 2025 04:37
@night-owl-1709
Copy link
Contributor Author

@navneet1v addressed your comment

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall

assertTrue("Should return search results for " + indexName, response.contains(HITS_PATTERN));

// Verify non-vector document (doc 10) is not in k-NN results
assertFalse("Non-vector document should not appear in k-NN results for " + indexName, response.contains("\"_id\":\"10\""));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking just one document is a bit of a weak assert, you might want to collect ids in a set and find the difference to always be empty.

@navneet1v
Copy link
Collaborator

@night-owl-1709 can we fix comments from @shatejas and also rebase the code.

);
}

private void addDocument(String indexName, int docId, XContentBuilder doc) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in the base class

);
}

private Settings createIndexSettings(KNNEngine engine, SpaceType spaceType, boolean diskBased) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if this in the base class or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants