-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Update sparse_vector field mapping to include default setting for token pruning #126739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update sparse_vector field mapping to include default setting for token pruning #126739
Conversation
Pinging @elastic/search-eng (Team:SearchOrg) |
Pinging @elastic/search-relevance (Team:Search - Relevance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review, nice start! Will take me a bit to work through this as it's a hefty PR!
- do: | ||
indices.delete: | ||
index: ["test1", "test2"] | ||
ignore: 404 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies if this has been covered already, but why do we need to explicitly delete the indices? I thought this was handled automatically between YAML test suites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should - but there's a number of tests that try to create indices with the same names, and we've found that it can be flaky sometimes so, best to ensure their removal with each individual test.
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION = | ||
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we make this package private like the other index versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in the SparseVectorQueryBuilder to ensure we use the same index version... we could add the same variable there, but, I think this would be more consistent.
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
public static Boolean parseIndexOptionsPruneValue(Map<String, Object> indexOptionsMap) { | ||
Object shouldPrune = indexOptionsMap.remove(IndexOptions.PRUNE_FIELD_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm not a fan of mutating the passed-in map. We accept it elsewhere because it's baked in at this point, but IMO we shouldn't do it for new implementations.
} | ||
|
||
public static TokenPruningConfig parseIndexOptionsPruningConfig(Boolean prune, Map<String, Object> indexOptionsMap) { | ||
Object pruningConfiguration = indexOptionsMap.remove(IndexOptions.PRUNING_CONFIG_FIELD_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
Hi @markjhoy, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got further in my partial review. Found some bugs and some areas where we can simplify.
pr: 126739 | ||
summary: Update `sparse_vector` field mapping to include default setting for token | ||
pruning | ||
area: Relevance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping may be a better area for this PR
} | ||
|
||
if (prune == null || prune == false) { | ||
throw new MapperParsingException("[index_options] field [pruning_config] should only be set if [prune] is set to true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: better to use variables to build the string ( i.e PRUNE_FIELD_NAME
, PRUNING_CONFIG_FIELD_NAME
)
(b, n, v) -> { | ||
if (v != null) { | ||
b.field(n, v); | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can simplify this to XContentBuilder::field
public static TokenPruningConfig parseFromMap(Map<String, Object> pruningConfigMap) { | ||
Object mappedTokensFreqRatioThreshold = pruningConfigMap.remove(TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName()); | ||
Object mappedTokensWeightThreshold = pruningConfigMap.remove(TOKENS_WEIGHT_THRESHOLD.getPreferredName()); | ||
Object mappedOnlyScorePrunedTokens = pruningConfigMap.remove(ONLY_SCORE_PRUNED_TOKENS_FIELD.getPreferredName()); | ||
|
||
if (pruningConfigMap.isEmpty() == false) { | ||
throw new MapperParsingException("[" + PRUNING_CONFIG_FIELD + "] has unknown fields"); | ||
} | ||
|
||
Float tokensFreqRatioThreshold = parseTokensFreqRatioThreshold(mappedTokensFreqRatioThreshold); | ||
Float tokensWeightThreshold = parseTokensWeightThreshold(mappedTokensWeightThreshold); | ||
boolean onlyScorePrunedTokens = mappedOnlyScorePrunedTokens != null ? parseScorePrunedTokens(mappedOnlyScorePrunedTokens) : false; | ||
|
||
if (tokensFreqRatioThreshold != null || tokensWeightThreshold != null) { | ||
return new TokenPruningConfig(tokensFreqRatioThreshold, tokensWeightThreshold, onlyScorePrunedTokens); | ||
} | ||
|
||
return null; | ||
} | ||
|
||
private static Float parseFloatNumberFromObject(Object numberObject, String fieldName, String exceptionDetails) { | ||
if (numberObject instanceof Integer intValue) { | ||
return (float) intValue; | ||
} else if (numberObject instanceof Float floatValue) { | ||
return floatValue; | ||
} else if (numberObject instanceof Double doubleValue) { | ||
return ((Double) numberObject).floatValue(); | ||
} | ||
|
||
throw new MapperParsingException("[" + PRUNING_CONFIG_FIELD + "] field [" + fieldName + "]" + exceptionDetails); | ||
} | ||
|
||
private static Float parseTokensWeightThreshold(Object mappedTokensWeightThreshold) { | ||
if (mappedTokensWeightThreshold == null) { | ||
return DEFAULT_TOKENS_WEIGHT_THRESHOLD; | ||
} | ||
|
||
Float tokensWeightThreshold = parseFloatNumberFromObject( | ||
mappedTokensWeightThreshold, | ||
TOKENS_WEIGHT_THRESHOLD.getPreferredName(), | ||
"field should be a number between 0.0 and 1.0" | ||
); | ||
|
||
if (tokensWeightThreshold < MIN_TOKENS_WEIGHT_THRESHOLD || tokensWeightThreshold > MAX_TOKENS_WEIGHT_THRESHOLD) { | ||
throw new MapperParsingException( | ||
"[" | ||
+ PRUNING_CONFIG_FIELD | ||
+ "] field [" | ||
+ TOKENS_WEIGHT_THRESHOLD.getPreferredName() | ||
+ "] field should be a number between 0.0 and 1.0" | ||
); | ||
} | ||
return tokensWeightThreshold; | ||
} | ||
|
||
private static Float parseTokensFreqRatioThreshold(Object mappedTokensFreqRatioThreshold) { | ||
if (mappedTokensFreqRatioThreshold == null) { | ||
return DEFAULT_TOKENS_FREQ_RATIO_THRESHOLD; | ||
} | ||
|
||
Float tokensFreqRatioThreshold = parseFloatNumberFromObject( | ||
mappedTokensFreqRatioThreshold, | ||
TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName(), | ||
"field should be a number between 1 and 100" | ||
); | ||
|
||
if (tokensFreqRatioThreshold < MIN_TOKENS_FREQ_RATIO_THRESHOLD || tokensFreqRatioThreshold > MAX_TOKENS_FREQ_RATIO_THRESHOLD) { | ||
throw new MapperParsingException( | ||
"[" | ||
+ PRUNING_CONFIG_FIELD | ||
+ "] field [" | ||
+ TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName() | ||
+ "] field should be a number between 1 and 100" | ||
); | ||
} | ||
|
||
return tokensFreqRatioThreshold; | ||
} | ||
|
||
private static boolean parseScorePrunedTokens(Object mappedScorePrunedTokens) { | ||
if (mappedScorePrunedTokens == null) { | ||
return false; | ||
} | ||
|
||
if (mappedScorePrunedTokens instanceof Boolean boolValue) { | ||
return boolValue; | ||
} | ||
|
||
throw new MapperParsingException( | ||
"[" + PRUNING_CONFIG_FIELD + "] field [" + ONLY_SCORE_PRUNED_TOKENS_FIELD.getPreferredName() + "] field should be true or false" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of this logic seems to be duplicating that provided by ConstructingObjectParser
. I suggest we refactor this to use that instead. See here for an example:
Lines 113 to 129 in e7d4a28
static MinimalServiceSettings parseModelSettingsFromMap(Object node) { | |
if (node == null) { | |
return null; | |
} | |
try { | |
Map<String, Object> map = XContentMapValues.nodeMapValue(node, MODEL_SETTINGS_FIELD); | |
XContentParser parser = new MapXContentParser( | |
NamedXContentRegistry.EMPTY, | |
DeprecationHandler.IGNORE_DEPRECATIONS, | |
map, | |
XContentType.JSON | |
); | |
return MinimalServiceSettings.parse(parser); | |
} catch (Exception exc) { | |
throw new ElasticsearchException(exc); | |
} | |
} |
assertEquals(2, fields.size()); | ||
assertThat(fields.get(0), Matchers.instanceOf(XFeatureField.class)); | ||
XFeatureField featureField1 = null; | ||
XFeatureField featureField2 = null; | ||
for (IndexableField field : fields) { | ||
if (field.stringValue().equals("ten")) { | ||
featureField1 = (XFeatureField) field; | ||
} else if (field.stringValue().equals("twenty")) { | ||
featureField2 = (XFeatureField) field; | ||
} else { | ||
throw new UnsupportedOperationException(); | ||
} | ||
} | ||
|
||
int freq1 = getFrequency(featureField1.tokenStream(null, null)); | ||
int freq2 = getFrequency(featureField2.tokenStream(null, null)); | ||
assertTrue(freq1 < freq2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we factor this check into a helper method instead duplicating the code?
if (shouldPruneTokens == null && indexPruningSettings.prune != null && indexPruningSettings.prune) { | ||
doPruneTokens = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug here: what if the query sets shouldPruneTokens
to true
without setting the pruning config? By my reading of this method, doPruneTokens
will end up false
in this case.
private TokenPruningSet setPruningConfigFromIndexIfNeeded( | ||
Boolean queryPruneTokens, | ||
TokenPruningConfig queryPruningConfig, | ||
SparseVectorFieldMapper fieldMapper | ||
) { | ||
boolean doPruneTokens = false; | ||
TokenPruningConfig setTokenPruningConfig = queryPruningConfig; | ||
if (queryPruneTokens == null || queryPruningConfig == null) { | ||
IndexFieldPruningSettings indexPruningSettings = getIndexFieldPruningSettings(fieldMapper); | ||
if (shouldPruneTokens == null && indexPruningSettings.prune != null && indexPruningSettings.prune) { | ||
doPruneTokens = true; | ||
} | ||
if (setTokenPruningConfig == null && indexPruningSettings.pruningConfig != null) { | ||
setTokenPruningConfig = indexPruningSettings.pruningConfig; | ||
} | ||
} else { | ||
doPruneTokens = queryPruneTokens; | ||
} | ||
|
||
return new TokenPruningSet(doPruneTokens, setTokenPruningConfig); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is hard to read and contains at least one bug (see #126739 (comment)). For example, queryPruneTokens
is always set to shouldPruneTokens
, yet shouldPruneTokens
is also referenced in this method.
IMO we should refactor it to just return the pruning settings from the mapping and then handle query-time overrides where it is called.
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION = | ||
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT; | ||
|
||
private final SparseVectorFieldMapper.IndexOptions indexOptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to store indexOptions
in the mapper, we're already storing it in the field type
builderParams(this, context) | ||
); | ||
} | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return this.indexOptions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the index options by calling fieldType().getIndexOptions()
instead
// if the query options for pruning are not set, | ||
// we need to check the index options for this field | ||
// and use those if set. | ||
SparseVectorFieldMapper sparseVectorFieldMapper = getSparseVectorFieldMapperForQueryRewrite(fieldName, queryRewriteContext); | ||
TokenPruningSet pruningOptions = setPruningConfigFromIndexIfNeeded( | ||
shouldPruneTokens, | ||
tokenPruningConfig, | ||
sparseVectorFieldMapper | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we parsing the mapping pruning options here as well as in doToQuery
? Unless I'm missing something, it makes sense to parse the mapping pruning options as late as possible, which is in doToQuery
.
Also, this logic is not going to act the way you think it will in a multi-node environment. Part of the query rewrite cycle happens on the coordinator node, which doesn't have access to the mappings. You're going to get an NPE here in that case. Another reason to remove this logic and do it only in doToQuery
, where you are guaranteed to have access to mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got through everything except the documentation, nice work! Some bug fixes, simplification, and more testing coverage is required, but the high-level functionality looks good 👍
public void testWeCorrectlyRewriteQueryIntoVectorsWithIndexOptions() { | ||
wrapTestSparseVectorIndexOptions((c) -> { | ||
SearchExecutionContext searchExecutionContext = createSearchExecutionContext(); | ||
|
||
TokenPruningConfig defaultTokenPruningConfig = new TokenPruningConfig(12, 0.6f, false); | ||
|
||
SparseVectorQueryBuilder queryBuilder = createTestQueryBuilder(null); | ||
QueryBuilder rewrittenQueryBuilder = rewriteAndFetch(queryBuilder, searchExecutionContext); | ||
assertTrue(rewrittenQueryBuilder instanceof SparseVectorQueryBuilder); | ||
assertEquals(queryBuilder.shouldPruneTokens(), ((SparseVectorQueryBuilder) rewrittenQueryBuilder).shouldPruneTokens()); | ||
assertNotNull(((SparseVectorQueryBuilder) rewrittenQueryBuilder).getQueryVectors()); | ||
}); | ||
} | ||
|
||
private void wrapTestSparseVectorIndexOptions(Consumer<Boolean> testMethod) { | ||
testWithSparseVectorFieldIndexOptions = true; | ||
try { | ||
testMethod.accept(true); | ||
} finally { | ||
testWithSparseVectorFieldIndexOptions = false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than doing this test wrapping approach, I recommend randomizing the value of testWithSparseVectorFieldIndexOptions
and using doAssertLuceneQuery
to validate the results. See how we randomize inferenceResultType
in SemanticQueryBuilderTests
as an example.
private void addSparseVectorIndexOptionsMapping(MapperService mapperService) throws IOException { | ||
String addIndexOptionsTemplate = "{\"properties\":{\"" | ||
+ SPARSE_VECTOR_FIELD | ||
+ "\":{\"type\":\"sparse_vector\",\"index_options\"" | ||
+ ":{\"prune\":true,\"pruning_config\":{\"tokens_freq_ratio_threshold\"" | ||
+ ":12,\"tokens_weight_threshold\":0.6}}}}}"; | ||
mapperService.merge("_doc", new CompressedXContent(addIndexOptionsTemplate), MapperService.MergeReason.MAPPING_UPDATE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify the string passed to mapperService.merge
in initializeAdditionalMappings
instead of duplicating the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more tests here to cover query-time overrides of token pruning behavior, including partial overrides (like if only prune: true
without a pruning config is specified in the query). We shouldn't use randomization to test this, however we need to implement it with the randomization of the test setup in mind. In other words, we should deterministically set the query-time overrides, but the expected result will be dependent on the random value of testWithSparseVectorFieldIndexOptions
.
|
||
private final SparseVectorFieldMapper.IndexOptions indexOptions; | ||
|
||
public static final NodeFeature SPARSE_VECTOR_INDEX_OPTIONS_FEATURE = new NodeFeature("sparse_vector_index_options_supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiniest of nits here, but we normally pseudo-namespace node features using .
. In this case, I think sparse_vector.index_options_supported
would work well, where sparse_vector.
is the pseudo-namespace.
embeddings: | ||
type: sparse_vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the embeddings
field in these tests?
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Updates the
SparseVectorFieldMapper
type to include index options for pruning tokens and associated configuration values.Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).
With this PR, by default, any new indices with a
sparse_vector
field type will by default have token pruning turned on.Example: