- 
                Notifications
    
You must be signed in to change notification settings  - Fork 25.6k
 
Adding base64 indexing for vector values #137072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adding base64 indexing for vector values #137072
Conversation
| 
           Pinging @elastic/es-search-relevance (Team:Search Relevance)  | 
    
| 
           Hi @benwtrent, I've created a changelog YAML for you.  | 
    
| 
           I understand that lucene API is little endian and that's the reason little endian has been chosen to represent the float array. On the other hand Elasticsearch API's are big endian (think on BigArrays) so I would be more incline to use big endianness here as this is an Elasticsearch API and in addition we could read those bytes directly to BigArrays.  | 
    
| 
           Ah, I need to reconfigure value fetching here, users could provide a mix of arrays and base64 strings into the same set of docs and value fetching should work for all of it.  | 
    
| protected Object parseSourceValue(Object value) { | ||
| if (value.equals("")) { | ||
| return null; | ||
| public List<Object> fetchValues(Source source, int doc, List<Object> ignoredValues) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlosdelest what do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok as we need to retrieve both Strings and numeric arrays from source - something I did not do on the previous iteration. Makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carlosdelest are there any tests where ESQL indexes docs and then fetches the values? If so, I would like to add some base64 values for vectors and ensure when fetched they are always transformed to arrays (as that is all ESQL supports for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent sure! Check DenseVectorFieldTypeIT, docs are added here and then retrieved in other methods. It would be a good idea to add randomly using base 64 / hex strings / arrays 👍
        
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | case VALUE_STRING -> parseHexEncodedVector(context, dimChecker, similarity); | ||
| case VALUE_STRING -> { | ||
| String s = context.parser().text(); | ||
| if (s.length() == dims * 2 && isMaybeHexString(s)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth doing the check on each character here? Maybe just try to parse it as hex straight away if its the right length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thecoop 🤔 hmm, likely that is good enough, especially since I already wrap with a try{}catch(...), this was part of an earlier iteration where I wasn't retrying on failure.
        
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Adding tests to ES|QL would be great to ensure the source value fetcher works as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
A question on VALUE_EMBEDDED_OBJECT and how it is tested
        
          
                server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | @Name("similarity") DenseVectorFieldMapper.VectorSimilarity similarity, | ||
| @Name("index") boolean index, | ||
| @Name("synthetic") boolean synthetic | ||
| @Name("synthetic") VectorSourceOptions sourceOptions | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit
| @Name("synthetic") VectorSourceOptions sourceOptions | |
| @Name("sourceOptions") VectorSourceOptions sourceOptions | 
| } | ||
| docs[i] = prepareIndex("test").setId("" + i).setSource("id", String.valueOf(i), "vector", vector); | ||
| Object vectorToIndex; | ||
| if (randomBoolean()) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for adding this
| case String s -> values.add(s); | ||
| default -> ignoredValues.add(sourceValue); | ||
| } | ||
| } catch (Exception e) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this catch is catching - the try is just adding to a collection, which can't really fail?
| yield decodedVector.length; | ||
| String v = context.parser().text(); | ||
| // Base64 is always divisible by 4, so if it's not, assume hex | ||
| if (v.length() % 4 != 0 || isMaybeHexString(v)) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to keep the Maybe check here, or just try to parse it directly below?
        
          
                ...n/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...n/esql/src/internalClusterTest/java/org/elasticsearch/xpack/esql/DenseVectorFieldTypeIT.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      …rent/elasticsearch into add-base64-encoded-float32-support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you Ben!
| values.add(NumberFieldMapper.NumberType.FLOAT.parse(o, false)); | ||
| } | ||
| } else if (sourceValue instanceof String s) { | ||
| if ((element.elementType() == BYTE_ELEMENT.elementType() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
element.elementType() == ElementType.BYTE || element.elementType() == ElementType.BIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a nit, otherwise LGTM
This adds support for indexing vectors via base64. Parsing floating point arrays in JSON is...not cheap. So, if we encode the bytes in a string, then we improve throughput.
Example python transforming a file (thank you copilot...)
I benchmarked locally with
random_vectortrack indexing toflatindex, and here are the highlights: