-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Update sparse_vector field mapping to include default setting for token pruning #126739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
e02cd3a
e24ab76
f39b78a
eeebfd8
51aab0c
983ddf1
9545a0c
19fe72d
58f9909
5f8e7b9
d7d27ba
d342656
96096ba
eed88c6
6a6052a
f977ea8
f38a6f1
436183b
24438e3
501099d
21323e4
a282e27
9d5df84
846fcff
2b3299e
832fe45
e593f17
3086a4b
7a24703
7ceb12a
f9d44e5
af006d4
5dd4728
f3b4a98
7ddb77a
02868b1
3625a37
92db1c6
a022b5c
6a7f46c
e95033c
99c3700
bdfc9b8
20bcf20
1f0718d
e30a141
404e645
bdfcf5e
f27dfb8
283b563
65c5147
fc78b0f
15c5eb3
0b17c16
4b46300
08d51c9
ae34841
74b19ca
a341322
a47b915
4e681bd
6e50539
fcf682f
095bb28
0c8c095
5bb6561
e4d547a
6c5e253
29dae8a
31f9e6d
0b1c1d2
b48deea
7f60eca
d8f3c63
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
pr: 126739 | ||
summary: Update `sparse_vector` field mapping to include default setting for token | ||
pruning | ||
area: Relevance | ||
type: enhancement | ||
issues: [] |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a reminder, that we'll have to open a PR for 8.19 to update the appropriate asciidoc files as well 👍 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,28 @@ PUT my-index | |
} | ||
``` | ||
|
||
Also, with optional `index_options` for pruning: | ||
|
||
```console | ||
PUT my-index | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"text.tokens": { | ||
"type": "sparse_vector", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if we should have two examples here - one "simple" example that creates the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we decide not to have multiple examples here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had another example there... not sure what happened... maybe something got reverted... |
||
"index_options": { | ||
"prune": true, | ||
"pruning_config": { | ||
"tokens_freq_ratio_threshold": 5, | ||
"tokens_weight_threshold": 0.4 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER. | ||
|
||
## Parameters for `sparse_vector` fields [sparse-vectors-params] | ||
|
@@ -36,6 +58,28 @@ The following parameters are accepted by `sparse_vector` fields: | |
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). | ||
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source). | ||
|
||
index_options | ||
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your `sparse_query` vector, Elasticsearch will use the default options configured for the field, if any. The available options for the index options are: | ||
|
||
Parameters for `index_options` are: | ||
|
||
`prune` | ||
: (Optional, boolean) [preview] Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true. | ||
|
||
`pruning_config` | ||
: (Optional, object) [preview] Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur. | ||
|
||
Parameters for `pruning_config` include: | ||
|
||
`tokens_freq_ratio_threshold` | ||
: (Optional, integer) [preview] Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`. | ||
|
||
`tokens_weight_threshold` | ||
: (Optional, float) [preview] Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`. | ||
|
||
::::{note} | ||
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results. | ||
:::: | ||
|
||
|
||
## Multi-value sparse vectors [index-multi-value-sparse-vectors] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,15 @@ | ||
|
||
--- | ||
teardown: | ||
# ensure indices are cleaned up after each test | ||
# mainly for the sparse vector tests | ||
- do: | ||
indices.delete: | ||
index: ["test1", "test2"] | ||
ignore: 404 | ||
Comment on lines
+6
to
+9
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies if this has been covered already, but why do we need to explicitly delete the indices? I thought this was handled automatically between YAML test suites. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should - but there's a number of tests that try to create indices with the same names, and we've found that it can be flaky sometimes so, best to ensure their removal with each individual test. |
||
- do: | ||
indices.refresh: { } | ||
|
||
--- | ||
"cluster stats test": | ||
- do: | ||
|
@@ -358,6 +370,7 @@ | |
- requires: | ||
cluster_features: [ "gte_v8.15.0" ] | ||
reason: "sparse vector stats added in 8.15" | ||
|
||
- do: | ||
indices.create: | ||
index: test1 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,9 @@ | |
import org.apache.lucene.util.BytesRef; | ||
import org.elasticsearch.common.logging.DeprecationCategory; | ||
import org.elasticsearch.common.lucene.Lucene; | ||
import org.elasticsearch.common.xcontent.support.XContentMapValues; | ||
import org.elasticsearch.core.Nullable; | ||
import org.elasticsearch.features.NodeFeature; | ||
import org.elasticsearch.index.IndexVersion; | ||
import org.elasticsearch.index.IndexVersions; | ||
import org.elasticsearch.index.analysis.NamedAnalyzer; | ||
|
@@ -31,13 +34,16 @@ | |
import org.elasticsearch.index.mapper.FieldMapper; | ||
import org.elasticsearch.index.mapper.MappedFieldType; | ||
import org.elasticsearch.index.mapper.MapperBuilderContext; | ||
import org.elasticsearch.index.mapper.MapperParsingException; | ||
import org.elasticsearch.index.mapper.MappingParserContext; | ||
import org.elasticsearch.index.mapper.SourceLoader; | ||
import org.elasticsearch.index.mapper.SourceValueFetcher; | ||
import org.elasticsearch.index.mapper.TextSearchInfo; | ||
import org.elasticsearch.index.mapper.ValueFetcher; | ||
import org.elasticsearch.index.query.SearchExecutionContext; | ||
import org.elasticsearch.search.fetch.StoredFieldsSpec; | ||
import org.elasticsearch.search.lookup.Source; | ||
import org.elasticsearch.xcontent.ToXContent; | ||
import org.elasticsearch.xcontent.XContentBuilder; | ||
import org.elasticsearch.xcontent.XContentParser.Token; | ||
|
||
|
@@ -46,6 +52,7 @@ | |
import java.util.LinkedHashMap; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
import java.util.stream.Stream; | ||
|
||
import static org.elasticsearch.index.query.AbstractQueryBuilder.DEFAULT_BOOST; | ||
|
@@ -57,6 +64,7 @@ | |
public class SparseVectorFieldMapper extends FieldMapper { | ||
|
||
public static final String CONTENT_TYPE = "sparse_vector"; | ||
public static final String SPARSE_VECTOR_INDEX_OPTIONS = "index_options"; | ||
|
||
static final String ERROR_MESSAGE_7X = "[sparse_vector] field type in old 7.x indices is allowed to " | ||
+ "contain [sparse_vector] fields, but they cannot be indexed or searched."; | ||
|
@@ -65,6 +73,12 @@ public class SparseVectorFieldMapper extends FieldMapper { | |
|
||
static final IndexVersion NEW_SPARSE_VECTOR_INDEX_VERSION = IndexVersions.NEW_SPARSE_VECTOR; | ||
static final IndexVersion SPARSE_VECTOR_IN_FIELD_NAMES_INDEX_VERSION = IndexVersions.SPARSE_VECTOR_IN_FIELD_NAMES_SUPPORT; | ||
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION = | ||
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT; | ||
Comment on lines
+76
to
+77
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: can we make this package private like the other index versions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's used in the SparseVectorQueryBuilder to ensure we use the same index version... we could add the same variable there, but, I think this would be more consistent. |
||
|
||
private final SparseVectorFieldMapper.IndexOptions indexOptions; | ||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's no need to store |
||
|
||
public static final NodeFeature SPARSE_VECTOR_INDEX_OPTIONS_FEATURE = new NodeFeature("sparse_vector_index_options_supported"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tiniest of nits here, but we normally pseudo-namespace node features using |
||
|
||
private static SparseVectorFieldMapper toType(FieldMapper in) { | ||
return (SparseVectorFieldMapper) in; | ||
|
@@ -73,9 +87,23 @@ private static SparseVectorFieldMapper toType(FieldMapper in) { | |
public static class Builder extends FieldMapper.Builder { | ||
private final Parameter<Boolean> stored = Parameter.storeParam(m -> toType(m).fieldType().isStored(), false); | ||
private final Parameter<Map<String, String>> meta = Parameter.metaParam(); | ||
private final Parameter<IndexOptions> indexOptions; | ||
|
||
public Builder(String name) { | ||
super(name); | ||
this.indexOptions = new Parameter<>( | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
SPARSE_VECTOR_INDEX_OPTIONS, | ||
true, | ||
() -> null, | ||
(n, c, o) -> parseIndexOptions(c, o), | ||
m -> toType(m).fieldType().indexOptions, | ||
(b, n, v) -> { | ||
if (v != null) { | ||
b.field(n, v); | ||
} | ||
}, | ||
Comment on lines
+100
to
+104
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: I think we can simplify this to |
||
Objects::toString | ||
); | ||
} | ||
|
||
public Builder setStored(boolean value) { | ||
|
@@ -85,19 +113,40 @@ public Builder setStored(boolean value) { | |
|
||
@Override | ||
protected Parameter<?>[] getParameters() { | ||
return new Parameter<?>[] { stored, meta }; | ||
return new Parameter<?>[] { stored, meta, indexOptions }; | ||
} | ||
|
||
@Override | ||
public SparseVectorFieldMapper build(MapperBuilderContext context) { | ||
return new SparseVectorFieldMapper( | ||
leafName(), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue()), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue(), indexOptions.getValue()), | ||
builderParams(this, context) | ||
); | ||
} | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return this.indexOptions; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can get the index options by calling |
||
} | ||
|
||
private static SparseVectorFieldMapper.IndexOptions parseIndexOptions(MappingParserContext context, Object propNode) { | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (propNode == null) { | ||
return null; | ||
} | ||
|
||
Map<String, Object> indexOptionsMap = XContentMapValues.nodeMapValue(propNode, SPARSE_VECTOR_INDEX_OPTIONS); | ||
|
||
Boolean prune = IndexOptions.parseIndexOptionsPruneValue(indexOptionsMap); | ||
TokenPruningConfig pruningConfig = IndexOptions.parseIndexOptionsPruningConfig(prune, indexOptionsMap); | ||
|
||
if (prune == null && pruningConfig == null) { | ||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return null; | ||
} | ||
|
||
return new SparseVectorFieldMapper.IndexOptions(prune, pruningConfig); | ||
} | ||
|
||
public static final TypeParser PARSER = new TypeParser((n, c) -> { | ||
if (c.indexVersionCreated().before(PREVIOUS_SPARSE_VECTOR_INDEX_VERSION)) { | ||
deprecationLogger.warn(DeprecationCategory.MAPPINGS, "sparse_vector", ERROR_MESSAGE_7X); | ||
|
@@ -109,9 +158,24 @@ public SparseVectorFieldMapper build(MapperBuilderContext context) { | |
}, notInMultiFields(CONTENT_TYPE)); | ||
|
||
public static final class SparseVectorFieldType extends MappedFieldType { | ||
private final IndexOptions indexOptions; | ||
|
||
public SparseVectorFieldType(String name, boolean isStored, Map<String, String> meta) { | ||
this(name, isStored, meta, null); | ||
} | ||
|
||
public SparseVectorFieldType( | ||
String name, | ||
boolean isStored, | ||
Map<String, String> meta, | ||
@Nullable SparseVectorFieldMapper.IndexOptions indexOptions | ||
) { | ||
super(name, true, isStored, false, TextSearchInfo.SIMPLE_MATCH_ONLY, meta); | ||
this.indexOptions = indexOptions; | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return indexOptions; | ||
} | ||
|
||
@Override | ||
|
@@ -159,6 +223,7 @@ private static String indexedValueForSearch(Object value) { | |
|
||
private SparseVectorFieldMapper(String simpleName, MappedFieldType mappedFieldType, BuilderParams builderParams) { | ||
super(simpleName, mappedFieldType, builderParams); | ||
this.indexOptions = ((SparseVectorFieldType) mappedFieldType).getIndexOptions(); | ||
} | ||
|
||
@Override | ||
|
@@ -364,4 +429,86 @@ public void reset() { | |
} | ||
} | ||
|
||
public static class IndexOptions implements ToXContent { | ||
public static final String PRUNE_FIELD_NAME = "prune"; | ||
public static final String PRUNING_CONFIG_FIELD_NAME = "pruning_config"; | ||
|
||
final Boolean prune; | ||
final TokenPruningConfig pruningConfig; | ||
|
||
IndexOptions(@Nullable Boolean prune, @Nullable TokenPruningConfig pruningConfig) { | ||
this.prune = prune; | ||
this.pruningConfig = pruningConfig; | ||
} | ||
|
||
public Boolean getPrune() { | ||
return prune; | ||
} | ||
|
||
public TokenPruningConfig getPruningConfig() { | ||
return pruningConfig; | ||
} | ||
|
||
@Override | ||
public final boolean equals(Object other) { | ||
if (other == this) { | ||
return true; | ||
} | ||
|
||
if (other == null || getClass() != other.getClass()) { | ||
return false; | ||
} | ||
|
||
IndexOptions otherAsIndexOptions = (IndexOptions) other; | ||
return Objects.equals(prune, otherAsIndexOptions.prune) && Objects.equals(pruningConfig, otherAsIndexOptions.pruningConfig); | ||
} | ||
|
||
@Override | ||
public final int hashCode() { | ||
return Objects.hash(prune, pruningConfig); | ||
} | ||
|
||
@Override | ||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
builder.startObject(); | ||
|
||
if (prune != null) { | ||
builder.field(PRUNE_FIELD_NAME, prune); | ||
} | ||
if (pruningConfig != null) { | ||
builder.field(PRUNING_CONFIG_FIELD_NAME, pruningConfig); | ||
} | ||
|
||
builder.endObject(); | ||
return builder; | ||
} | ||
|
||
public static Boolean parseIndexOptionsPruneValue(Map<String, Object> indexOptionsMap) { | ||
Object shouldPrune = indexOptionsMap.remove(IndexOptions.PRUNE_FIELD_NAME); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general I'm not a fan of mutating the passed-in map. We accept it elsewhere because it's baked in at this point, but IMO we shouldn't do it for new implementations. |
||
if (shouldPrune == null) { | ||
return null; | ||
} | ||
|
||
if (shouldPrune instanceof Boolean boolValue) { | ||
return boolValue; | ||
} | ||
|
||
throw new MapperParsingException("[index_options] field [prune] should be true or false"); | ||
} | ||
|
||
public static TokenPruningConfig parseIndexOptionsPruningConfig(Boolean prune, Map<String, Object> indexOptionsMap) { | ||
Object pruningConfiguration = indexOptionsMap.remove(IndexOptions.PRUNING_CONFIG_FIELD_NAME); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here |
||
if (pruningConfiguration == null) { | ||
return null; | ||
} | ||
|
||
if (prune == null || prune == false) { | ||
throw new MapperParsingException("[index_options] field [pruning_config] should only be set if [prune] is set to true"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: better to use variables to build the string ( i.e |
||
} | ||
|
||
Map<String, Object> pruningConfigurationMap = XContentMapValues.nodeMapValue(pruningConfiguration, PRUNING_CONFIG_FIELD_NAME); | ||
|
||
return TokenPruningConfig.parseFromMap(pruningConfigurationMap); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping may be a better area for this PR