Skip to content

Conversation

@benwtrent
Copy link
Member

@benwtrent benwtrent commented Oct 23, 2025

This adds support for indexing vectors via base64. Parsing floating point arrays in JSON is...not cheap. So, if we encode the bytes in a string, then we improve throughput.

Example python transforming a file (thank you copilot...)

def base64_vector(dims):
    vec = np.random.rand(dims).astype(np.float32)
    # switch from default local of little endian to big endian
    byte_array = vec.byteswap().tobytes()
    # encode as base64
    return base64.b64encode(byte_array).decode("utf-8")

I benchmarked locally with random_vector track indexing to flat index, and here are the highlights:

|                                                        Metric |                Task |    Baseline |    Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|--------------------:|------------:|-------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                     |    2.74863  |     0.393017 |   -2.35562 |    min |  -85.70% |
|                                                Min Throughput |     random-indexing |  967.886    |  1515.79     |  547.903   | docs/s |  +56.61% |
|                                               Mean Throughput |     random-indexing | 1518.27     | 10175        | 8656.71    | docs/s | +570.17% |
|                                             Median Throughput |     random-indexing | 1531.12     | 10592.2      | 9061.07    | docs/s | +591.80% |
|                                                Max Throughput |     random-indexing | 1538.79     | 11064.7      | 9525.94    | docs/s | +619.05% |

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Oct 23, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@iverase
Copy link
Contributor

iverase commented Oct 24, 2025

I understand that lucene API is little endian and that's the reason little endian has been chosen to represent the float array. On the other hand Elasticsearch API's are big endian (think on BigArrays) so I would be more incline to use big endianness here as this is an Elasticsearch API and in addition we could read those bytes directly to BigArrays.

@benwtrent
Copy link
Member Author

Ah, I need to reconfigure value fetching here, users could provide a mix of arrays and base64 strings into the same set of docs and value fetching should work for all of it.

Comment on lines -2710 to +2881
protected Object parseSourceValue(Object value) {
if (value.equals("")) {
return null;
public List<Object> fetchValues(Source source, int doc, List<Object> ignoredValues) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlosdelest what do you think of this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok as we need to retrieve both Strings and numeric arrays from source - something I did not do on the previous iteration. Makes sense to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlosdelest are there any tests where ESQL indexes docs and then fetches the values? If so, I would like to add some base64 values for vectors and ensure when fetched they are always transformed to arrays (as that is all ESQL supports for now).

Copy link
Member

@carlosdelest carlosdelest Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent sure! Check DenseVectorFieldTypeIT, docs are added here and then retrieved in other methods. It would be a good idea to add randomly using base 64 / hex strings / arrays 👍

case VALUE_STRING -> parseHexEncodedVector(context, dimChecker, similarity);
case VALUE_STRING -> {
String s = context.parser().text();
if (s.length() == dims * 2 && isMaybeHexString(s)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth doing the check on each character here? Maybe just try to parse it as hex straight away if its the right length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thecoop 🤔 hmm, likely that is good enough, especially since I already wrap with a try{}catch(...), this was part of an earlier iteration where I wasn't retrying on failure.

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Adding tests to ES|QL would be great to ensure the source value fetcher works as intended.

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

A question on VALUE_EMBEDDED_OBJECT and how it is tested

@Name("similarity") DenseVectorFieldMapper.VectorSimilarity similarity,
@Name("index") boolean index,
@Name("synthetic") boolean synthetic
@Name("synthetic") VectorSourceOptions sourceOptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
@Name("synthetic") VectorSourceOptions sourceOptions
@Name("sourceOptions") VectorSourceOptions sourceOptions

}
docs[i] = prepareIndex("test").setId("" + i).setSource("id", String.valueOf(i), "vector", vector);
Object vectorToIndex;
if (randomBoolean()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for adding this

case String s -> values.add(s);
default -> ignoredValues.add(sourceValue);
}
} catch (Exception e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this catch is catching - the try is just adding to a collection, which can't really fail?

yield decodedVector.length;
String v = context.parser().text();
// Base64 is always divisible by 4, so if it's not, assume hex
if (v.length() % 4 != 0 || isMaybeHexString(v)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to keep the Maybe check here, or just try to parse it directly below?

@benwtrent benwtrent requested review from iverase and thecoop November 3, 2025 15:20
Copy link
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you Ben!

values.add(NumberFieldMapper.NumberType.FLOAT.parse(o, false));
}
} else if (sourceValue instanceof String s) {
if ((element.elementType() == BYTE_ELEMENT.elementType()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element.elementType() == ElementType.BYTE || element.elementType() == ElementType.BIT

Copy link
Member

@thecoop thecoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a nit, otherwise LGTM

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants