Skip to content

Update sparse_vector field mapping to include default setting for token pruning #126739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

markjhoy
Copy link
Contributor

@markjhoy markjhoy commented Apr 12, 2025

Updates the SparseVectorFieldMapper type to include index options for pruning tokens and associated configuration values.

Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).

With this PR, by default, any new indices with a sparse_vector field type will by default have token pruning turned on.

Example:

{
  "properties": {
    "example_field": {
       "type": "sparse_vector",
        "index_options": {
          "prune": (boolean, default is `true`),
          "pruning_config": {
            "tokens_freq_ratio_threshold": (integer, range 1-100, default is 5),
            "tokens_weight_threshold": (double, range 0.0-1.0, default if 0.4)
          }
        }
     }
  }
}

@markjhoy markjhoy added Team:SearchOrg Meta label for the Search Org (Enterprise Search) :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels May 6, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, nice start! Will take me a bit to work through this as it's a hefty PR!

Comment on lines +6 to +9
- do:
indices.delete:
index: ["test1", "test2"]
ignore: 404
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies if this has been covered already, but why do we need to explicitly delete the indices? I thought this was handled automatically between YAML test suites.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should - but there's a number of tests that try to create indices with the same names, and we've found that it can be flaky sometimes so, best to ensure their removal with each individual test.

Comment on lines +76 to +77
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION =
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we make this package private like the other index versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the SparseVectorQueryBuilder to ensure we use the same index version... we could add the same variable there, but, I think this would be more consistent.

}

public static Boolean parseIndexOptionsPruneValue(Map<String, Object> indexOptionsMap) {
Object shouldPrune = indexOptionsMap.remove(IndexOptions.PRUNE_FIELD_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm not a fan of mutating the passed-in map. We accept it elsewhere because it's baked in at this point, but IMO we shouldn't do it for new implementations.

}

public static TokenPruningConfig parseIndexOptionsPruningConfig(Boolean prune, Map<String, Object> indexOptionsMap) {
Object pruningConfiguration = indexOptionsMap.remove(IndexOptions.PRUNING_CONFIG_FIELD_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@elasticsearchmachine
Copy link
Collaborator

Hi @markjhoy, I've created a changelog YAML for you.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got further in my partial review. Found some bugs and some areas where we can simplify.

pr: 126739
summary: Update `sparse_vector` field mapping to include default setting for token
pruning
area: Relevance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping may be a better area for this PR

}

if (prune == null || prune == false) {
throw new MapperParsingException("[index_options] field [pruning_config] should only be set if [prune] is set to true");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better to use variables to build the string ( i.e PRUNE_FIELD_NAME, PRUNING_CONFIG_FIELD_NAME)

Comment on lines +100 to +104
(b, n, v) -> {
if (v != null) {
b.field(n, v);
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can simplify this to XContentBuilder::field

Comment on lines +187 to +278
public static TokenPruningConfig parseFromMap(Map<String, Object> pruningConfigMap) {
Object mappedTokensFreqRatioThreshold = pruningConfigMap.remove(TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName());
Object mappedTokensWeightThreshold = pruningConfigMap.remove(TOKENS_WEIGHT_THRESHOLD.getPreferredName());
Object mappedOnlyScorePrunedTokens = pruningConfigMap.remove(ONLY_SCORE_PRUNED_TOKENS_FIELD.getPreferredName());

if (pruningConfigMap.isEmpty() == false) {
throw new MapperParsingException("[" + PRUNING_CONFIG_FIELD + "] has unknown fields");
}

Float tokensFreqRatioThreshold = parseTokensFreqRatioThreshold(mappedTokensFreqRatioThreshold);
Float tokensWeightThreshold = parseTokensWeightThreshold(mappedTokensWeightThreshold);
boolean onlyScorePrunedTokens = mappedOnlyScorePrunedTokens != null ? parseScorePrunedTokens(mappedOnlyScorePrunedTokens) : false;

if (tokensFreqRatioThreshold != null || tokensWeightThreshold != null) {
return new TokenPruningConfig(tokensFreqRatioThreshold, tokensWeightThreshold, onlyScorePrunedTokens);
}

return null;
}

private static Float parseFloatNumberFromObject(Object numberObject, String fieldName, String exceptionDetails) {
if (numberObject instanceof Integer intValue) {
return (float) intValue;
} else if (numberObject instanceof Float floatValue) {
return floatValue;
} else if (numberObject instanceof Double doubleValue) {
return ((Double) numberObject).floatValue();
}

throw new MapperParsingException("[" + PRUNING_CONFIG_FIELD + "] field [" + fieldName + "]" + exceptionDetails);
}

private static Float parseTokensWeightThreshold(Object mappedTokensWeightThreshold) {
if (mappedTokensWeightThreshold == null) {
return DEFAULT_TOKENS_WEIGHT_THRESHOLD;
}

Float tokensWeightThreshold = parseFloatNumberFromObject(
mappedTokensWeightThreshold,
TOKENS_WEIGHT_THRESHOLD.getPreferredName(),
"field should be a number between 0.0 and 1.0"
);

if (tokensWeightThreshold < MIN_TOKENS_WEIGHT_THRESHOLD || tokensWeightThreshold > MAX_TOKENS_WEIGHT_THRESHOLD) {
throw new MapperParsingException(
"["
+ PRUNING_CONFIG_FIELD
+ "] field ["
+ TOKENS_WEIGHT_THRESHOLD.getPreferredName()
+ "] field should be a number between 0.0 and 1.0"
);
}
return tokensWeightThreshold;
}

private static Float parseTokensFreqRatioThreshold(Object mappedTokensFreqRatioThreshold) {
if (mappedTokensFreqRatioThreshold == null) {
return DEFAULT_TOKENS_FREQ_RATIO_THRESHOLD;
}

Float tokensFreqRatioThreshold = parseFloatNumberFromObject(
mappedTokensFreqRatioThreshold,
TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName(),
"field should be a number between 1 and 100"
);

if (tokensFreqRatioThreshold < MIN_TOKENS_FREQ_RATIO_THRESHOLD || tokensFreqRatioThreshold > MAX_TOKENS_FREQ_RATIO_THRESHOLD) {
throw new MapperParsingException(
"["
+ PRUNING_CONFIG_FIELD
+ "] field ["
+ TOKENS_FREQ_RATIO_THRESHOLD.getPreferredName()
+ "] field should be a number between 1 and 100"
);
}

return tokensFreqRatioThreshold;
}

private static boolean parseScorePrunedTokens(Object mappedScorePrunedTokens) {
if (mappedScorePrunedTokens == null) {
return false;
}

if (mappedScorePrunedTokens instanceof Boolean boolValue) {
return boolValue;
}

throw new MapperParsingException(
"[" + PRUNING_CONFIG_FIELD + "] field [" + ONLY_SCORE_PRUNED_TOKENS_FIELD.getPreferredName() + "] field should be true or false"
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this logic seems to be duplicating that provided by ConstructingObjectParser. I suggest we refactor this to use that instead. See here for an example:

static MinimalServiceSettings parseModelSettingsFromMap(Object node) {
if (node == null) {
return null;
}
try {
Map<String, Object> map = XContentMapValues.nodeMapValue(node, MODEL_SETTINGS_FIELD);
XContentParser parser = new MapXContentParser(
NamedXContentRegistry.EMPTY,
DeprecationHandler.IGNORE_DEPRECATIONS,
map,
XContentType.JSON
);
return MinimalServiceSettings.parse(parser);
} catch (Exception exc) {
throw new ElasticsearchException(exc);
}
}

Comment on lines +159 to +175
assertEquals(2, fields.size());
assertThat(fields.get(0), Matchers.instanceOf(XFeatureField.class));
XFeatureField featureField1 = null;
XFeatureField featureField2 = null;
for (IndexableField field : fields) {
if (field.stringValue().equals("ten")) {
featureField1 = (XFeatureField) field;
} else if (field.stringValue().equals("twenty")) {
featureField2 = (XFeatureField) field;
} else {
throw new UnsupportedOperationException();
}
}

int freq1 = getFrequency(featureField1.tokenStream(null, null));
int freq2 = getFrequency(featureField2.tokenStream(null, null));
assertTrue(freq1 < freq2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we factor this check into a helper method instead duplicating the code?

Comment on lines +485 to +487
if (shouldPruneTokens == null && indexPruningSettings.prune != null && indexPruningSettings.prune) {
doPruneTokens = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug here: what if the query sets shouldPruneTokens to true without setting the pruning config? By my reading of this method, doPruneTokens will end up false in this case.

Comment on lines +476 to +496
private TokenPruningSet setPruningConfigFromIndexIfNeeded(
Boolean queryPruneTokens,
TokenPruningConfig queryPruningConfig,
SparseVectorFieldMapper fieldMapper
) {
boolean doPruneTokens = false;
TokenPruningConfig setTokenPruningConfig = queryPruningConfig;
if (queryPruneTokens == null || queryPruningConfig == null) {
IndexFieldPruningSettings indexPruningSettings = getIndexFieldPruningSettings(fieldMapper);
if (shouldPruneTokens == null && indexPruningSettings.prune != null && indexPruningSettings.prune) {
doPruneTokens = true;
}
if (setTokenPruningConfig == null && indexPruningSettings.pruningConfig != null) {
setTokenPruningConfig = indexPruningSettings.pruningConfig;
}
} else {
doPruneTokens = queryPruneTokens;
}

return new TokenPruningSet(doPruneTokens, setTokenPruningConfig);
}
Copy link
Contributor

@Mikep86 Mikep86 May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is hard to read and contains at least one bug (see #126739 (comment)). For example, queryPruneTokens is always set to shouldPruneTokens, yet shouldPruneTokens is also referenced in this method.

IMO we should refactor it to just return the pruning settings from the mapping and then handle query-time overrides where it is called.

public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION =
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT;

private final SparseVectorFieldMapper.IndexOptions indexOptions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to store indexOptions in the mapper, we're already storing it in the field type

builderParams(this, context)
);
}
}

public IndexOptions getIndexOptions() {
return this.indexOptions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the index options by calling fieldType().getIndexOptions() instead

Comment on lines +274 to +282
// if the query options for pruning are not set,
// we need to check the index options for this field
// and use those if set.
SparseVectorFieldMapper sparseVectorFieldMapper = getSparseVectorFieldMapperForQueryRewrite(fieldName, queryRewriteContext);
TokenPruningSet pruningOptions = setPruningConfigFromIndexIfNeeded(
shouldPruneTokens,
tokenPruningConfig,
sparseVectorFieldMapper
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we parsing the mapping pruning options here as well as in doToQuery? Unless I'm missing something, it makes sense to parse the mapping pruning options as late as possible, which is in doToQuery.

Also, this logic is not going to act the way you think it will in a multi-node environment. Part of the query rewrite cycle happens on the coordinator node, which doesn't have access to the mappings. You're going to get an NPE here in that case. Another reason to remove this logic and do it only in doToQuery, where you are guaranteed to have access to mappings.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through everything except the documentation, nice work! Some bug fixes, simplification, and more testing coverage is required, but the high-level functionality looks good 👍

Comment on lines +360 to +381
public void testWeCorrectlyRewriteQueryIntoVectorsWithIndexOptions() {
wrapTestSparseVectorIndexOptions((c) -> {
SearchExecutionContext searchExecutionContext = createSearchExecutionContext();

TokenPruningConfig defaultTokenPruningConfig = new TokenPruningConfig(12, 0.6f, false);

SparseVectorQueryBuilder queryBuilder = createTestQueryBuilder(null);
QueryBuilder rewrittenQueryBuilder = rewriteAndFetch(queryBuilder, searchExecutionContext);
assertTrue(rewrittenQueryBuilder instanceof SparseVectorQueryBuilder);
assertEquals(queryBuilder.shouldPruneTokens(), ((SparseVectorQueryBuilder) rewrittenQueryBuilder).shouldPruneTokens());
assertNotNull(((SparseVectorQueryBuilder) rewrittenQueryBuilder).getQueryVectors());
});
}

private void wrapTestSparseVectorIndexOptions(Consumer<Boolean> testMethod) {
testWithSparseVectorFieldIndexOptions = true;
try {
testMethod.accept(true);
} finally {
testWithSparseVectorFieldIndexOptions = false;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this test wrapping approach, I recommend randomizing the value of testWithSparseVectorFieldIndexOptions and using doAssertLuceneQuery to validate the results. See how we randomize inferenceResultType in SemanticQueryBuilderTests as an example.

Comment on lines +163 to +170
private void addSparseVectorIndexOptionsMapping(MapperService mapperService) throws IOException {
String addIndexOptionsTemplate = "{\"properties\":{\""
+ SPARSE_VECTOR_FIELD
+ "\":{\"type\":\"sparse_vector\",\"index_options\""
+ ":{\"prune\":true,\"pruning_config\":{\"tokens_freq_ratio_threshold\""
+ ":12,\"tokens_weight_threshold\":0.6}}}}}";
mapperService.merge("_doc", new CompressedXContent(addIndexOptionsTemplate), MapperService.MergeReason.MAPPING_UPDATE);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify the string passed to mapperService.merge in initializeAdditionalMappings instead of duplicating the logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more tests here to cover query-time overrides of token pruning behavior, including partial overrides (like if only prune: true without a pruning config is specified in the query). We shouldn't use randomization to test this, however we need to implement it with the randomization of the test setup in mind. In other words, we should deterministically set the query-time overrides, but the expected result will be dependent on the random value of testWithSparseVectorFieldIndexOptions.


private final SparseVectorFieldMapper.IndexOptions indexOptions;

public static final NodeFeature SPARSE_VECTOR_INDEX_OPTIONS_FEATURE = new NodeFeature("sparse_vector_index_options_supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiniest of nits here, but we normally pseudo-namespace node features using .. In this case, I think sparse_vector.index_options_supported would work well, where sparse_vector. is the pseudo-namespace.

Comment on lines +229 to +230
embeddings:
type: sparse_vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the embeddings field in these tests?

@markjhoy markjhoy added the :Search Foundations/Mapping Index mappings, including merging and defining field types label May 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement needs:triage Requires assignment of a team area label :Search Foundations/Mapping Index mappings, including merging and defining field types :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants