-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ES|QL dense vector field type support #126456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ES|QL dense vector field type support #126456
Conversation
…ctor_support # Conflicts: # x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/type/DataType.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/ResponseValueUtils.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/LocalExecutionPlanner.java
…multivalued fields
@@ -504,6 +506,80 @@ public String toString() { | |||
} | |||
} | |||
|
|||
public static class DenseVectorBlockLoader extends DocValuesBlockLoader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a BlockLoader for dense vectors, that uses FloatVectorValues
to retrieve indexed vector data.
@Override | ||
public BlockLoader blockLoader(MappedFieldType.BlockLoaderContext blContext) { | ||
if (elementType != ElementType.FLOAT) { | ||
throw new UnsupportedOperationException("Only float dense vectors are supported for now"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can work on this next, creating specific BlockLoaders
for Byte and Bit field types.
@@ -145,6 +146,10 @@ private static void assertMetadata( | |||
// Type.asType translates all bytes references into keywords | |||
continue; | |||
} | |||
if (blockType == Type.DOUBLE && expectedType == DENSE_VECTOR) { | |||
// DENSE_VECTOR is internally represented as a double block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially change when we support byte and bit element types - we could create the appropriate blocks.
@@ -63,18 +63,7 @@ | |||
"type" : "keyword" | |||
}, | |||
"salary_change": { | |||
"type": "float", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing CSV loading made this a problem, as there were parsing exceptions when trying to index float numeric data into integers. I didn't see a convenient way out of this and decided to remove as this field is not being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can keep this unchanged, that will be great, this is a good example of nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just changed for the mapping-default-incompatible
mapping, which was created to test some incompatible field mappings that did not include subfields. I'll try to give this another shot but it will require changes to the CSV loader or the dataset 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just changed for the
mapping-default-incompatible
mapping, which was created to test some incompatible field mappings that did not include subfields. I'll try to give this another shot but it will require changes to the CSV loader or the dataset 😢
Is it easier if we make another copy of employees's schema and data for dense_vector
related tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier if we make another copy of employees's schema and data for dense_vector related tests?
The problem is that changing how the CSV tests load data impacted this dataset. Before this change, multivalues were being uploaded as arrays of strings, which is something we don't want to do for dense_vectors as that is not a format supported on the DenseVectorFieldMapper
.
It seemed like too much work to change the actual dataset when that particular field is not actually used in the tests.
I'm open to other solutions here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fang-xing-esql are these fields used in other tests somehow?
I see this was actually added by @carlosdelest a while back #117555.
The employees_incompatible
index that is set with this mapping is only used in match function/operator tests.
We don't modify any of those tests here, so this looks like a safe change to me.
…support' into feature/esql_dense_vector_support
@ioanatia @fang-xing-esql I was able to use As Ioana noted, we're not publicly exposing |
…support' into feature/esql_dense_vector_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to see FloatBlocks being used, but they have too much flexibility - all different dims per position and also nulls. Maybe we just need an easy way/utility to assert their correct shape?
.../esql/compute/src/main/java/org/elasticsearch/compute/lucene/ValuesSourceReaderOperator.java
Outdated
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/lucene/ValuesSourceReaderOperator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an outstanding question from @ChrisHegarty that would be good to address but otherwise LGTM
…ctor_support # Conflicts: # x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
@ChrisHegarty , I renamed some methods to make clear that we're creating dense vectors instead of floats, and added some checks for them, on c951ee7. I had to take some back on ba0a6b9, as I can't easily add a new float block builder given the sealed structure of builders. We may have to add a new Block type for LMKWYT |
@@ -0,0 +1,3 @@ | |||
id:l, vector:dense_vector | |||
0, [1.0, 2.0, 3.0] | |||
1, [4.0, 5.0, 6.0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be very nitpicky - but can we add another dense_vector
value that does not have ordered values?
this might be why we did not caught #126456 (comment) during tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mvOrdered does not impact retrieval as it's used as an optimization in some cases. But adding unordered data uncovered a small fix that needed to be done to support multivalued style fields: 4b2126e
.field("type", "dense_vector") | ||
.field("index", index); | ||
if (index) { | ||
mapping.field("similarity", "l2_norm"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to randomize the similarity option? otherwise we could leave this completely out, since it optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is some similarities normalize the vector values, so what we retrieve back is not what was stored, thus making comparisons difficult. I think we're ok with keeping this simple.
|
||
try (var resp = run(query)) { | ||
assertColumnNames(resp.columns(), List.of("id", "vector")); | ||
assertColumnTypes(resp.columns(), List.of("integer", "dense_vector")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we missing some assertions here for the values? to at least check that they are not nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's done in other tests - I started with this test to just check that the field types are retrieved correctly.
|
||
@Override | ||
public String toString() { | ||
return "BlockSourceReader.Floats"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "BlockSourceReader.DenseVectors"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, afe79f7
|
||
@Override | ||
protected String name() { | ||
return "Floats"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - just checking if this the value we want to return given that the class is called DenseVectorBlockLoader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I renamed this multiple times but didn't follow on this method. Thanks! afe79f7
…ctor_support # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/vectors/DenseVectorFieldMapper.java # x-pack/plugin/esql/qa/testFixtures/src/main/java/org/elasticsearch/xpack/esql/CsvTestsDataLoader.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java
Closing this PR, as the final approach will imply multiple field types for the different dense_vector element types. |
…ctor_support # Conflicts: # x-pack/plugin/build.gradle # x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/BlockUtils.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/action/EsqlCapabilities.java # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/planner/PlannerUtils.java
…support' into feature/esql_dense_vector_support
Reopening this, as we've decided to go for a single dense_vector field type. We can reevaluate the need for multiple dense_vector element types later. |
@ioanatia @ChrisHegarty @fang-xing-esql I intend to merge this PR on Monday to provide initial support for dense_vector float element types. Please comment if you have any issue with that. Next steps will be to work on KNN query and similarity functions, and eventually provide support for byte and bit element types. |
Support
dense_vector
field type. This is the first step to allowkNN
queries and havingdense_vector
as a first class citizen in ES|QLThis allows a mapping that has
dense_vector
like the following:To be retrieved via ES|QL:
For now, just
float
element types are allowed. There will be a similar work in order to allow forbyte
andbit
element types, but I wanted to review this implementation first to ensure it's in line with what we need.Both indexed / not indexed types and
synthetic
source is supported.Support for CSV tests has been added. For now CSV tests are simple, we can expand on these and also support additional operations on
dense_vector
field types in subsequent PRs. An integration test has been added to test extensively on different index options and doc storage structure.dense_vector
field type is under a feature flag, as this will require follow up work.