-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Update sparse_vector field mapping to include default setting for token pruning #126739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 74 commits
e02cd3a
e24ab76
f39b78a
eeebfd8
51aab0c
983ddf1
9545a0c
19fe72d
58f9909
5f8e7b9
d7d27ba
d342656
96096ba
eed88c6
6a6052a
f977ea8
f38a6f1
436183b
24438e3
501099d
21323e4
a282e27
9d5df84
846fcff
2b3299e
832fe45
e593f17
3086a4b
7a24703
7ceb12a
f9d44e5
af006d4
5dd4728
f3b4a98
7ddb77a
02868b1
3625a37
92db1c6
a022b5c
6a7f46c
e95033c
99c3700
bdfc9b8
20bcf20
1f0718d
e30a141
404e645
bdfcf5e
f27dfb8
283b563
65c5147
fc78b0f
15c5eb3
0b17c16
4b46300
08d51c9
ae34841
74b19ca
a341322
a47b915
4e681bd
6e50539
fcf682f
095bb28
0c8c095
5bb6561
e4d547a
6c5e253
29dae8a
31f9e6d
0b1c1d2
b48deea
7f60eca
d8f3c63
be78331
d76323d
a0cc202
223a794
dd4a218
d60e2df
a3bb904
4a23c9c
e2e65db
cf7a302
7a12676
b43b19e
28b5b74
515cf9e
1a692fc
4320315
1cd6f5e
c5b9def
fd07f1c
42bc77d
8f6672f
a6bdc90
f499269
5e9fee1
6209919
5fefadb
a4bb87c
1192271
d7f1351
94e21df
212c850
c8fcd94
fa1737d
7d2f361
0f9bd65
2afeb2b
60a6b3e
9fd5e72
fe2e267
9953513
925173c
d7b064b
04a597e
bbcd309
8ba9aef
c4c65b9
9b7327b
88fc1f4
c0704df
368f8be
a43d2e3
c755cc0
7a41e0f
c02a647
811ca1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
pr: 126739 | ||
summary: Update `sparse_vector` field mapping to include default setting for token | ||
pruning | ||
area: Relevance | ||
type: enhancement | ||
issues: [] |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a reminder, that we'll have to open a PR for 8.19 to update the appropriate asciidoc files as well 👍 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,6 +24,28 @@ PUT my-index | |
} | ||
``` | ||
|
||
Also, with optional `index_options` for pruning: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add some clarification here, RE: why you might want to override token pruning? |
||
|
||
```console | ||
PUT my-index | ||
{ | ||
"mappings": { | ||
"properties": { | ||
"text.tokens": { | ||
"type": "sparse_vector", | ||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"index_options": { | ||
"prune": true, | ||
"pruning_config": { | ||
"tokens_freq_ratio_threshold": 5, | ||
"tokens_weight_threshold": 0.4 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER. | ||
|
||
## Parameters for `sparse_vector` fields [sparse-vectors-params] | ||
|
@@ -36,6 +58,28 @@ The following parameters are accepted by `sparse_vector` fields: | |
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). | ||
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source). | ||
|
||
index_options | ||
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your `sparse_query` vector, Elasticsearch will use the default options configured for the field, if any. The available options for the index options are: | ||
|
||
Parameters for `index_options` are: | ||
|
||
`prune` | ||
: (Optional, boolean) [preview] Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true. | ||
|
||
`pruning_config` | ||
: (Optional, object) [preview] Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur. | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Parameters for `pruning_config` include: | ||
|
||
`tokens_freq_ratio_threshold` | ||
: (Optional, integer) [preview] Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`. | ||
|
||
`tokens_weight_threshold` | ||
: (Optional, float) [preview] Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`. | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
::::{note} | ||
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results. | ||
:::: | ||
|
||
|
||
## Multi-value sparse vectors [index-multi-value-sparse-vectors] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,9 @@ | |
import org.apache.lucene.util.BytesRef; | ||
import org.elasticsearch.common.logging.DeprecationCategory; | ||
import org.elasticsearch.common.lucene.Lucene; | ||
import org.elasticsearch.common.xcontent.support.XContentMapValues; | ||
import org.elasticsearch.core.Nullable; | ||
import org.elasticsearch.features.NodeFeature; | ||
import org.elasticsearch.index.IndexVersion; | ||
import org.elasticsearch.index.IndexVersions; | ||
import org.elasticsearch.index.analysis.NamedAnalyzer; | ||
|
@@ -31,13 +34,16 @@ | |
import org.elasticsearch.index.mapper.FieldMapper; | ||
import org.elasticsearch.index.mapper.MappedFieldType; | ||
import org.elasticsearch.index.mapper.MapperBuilderContext; | ||
import org.elasticsearch.index.mapper.MapperParsingException; | ||
import org.elasticsearch.index.mapper.MappingParserContext; | ||
import org.elasticsearch.index.mapper.SourceLoader; | ||
import org.elasticsearch.index.mapper.SourceValueFetcher; | ||
import org.elasticsearch.index.mapper.TextSearchInfo; | ||
import org.elasticsearch.index.mapper.ValueFetcher; | ||
import org.elasticsearch.index.query.SearchExecutionContext; | ||
import org.elasticsearch.search.fetch.StoredFieldsSpec; | ||
import org.elasticsearch.search.lookup.Source; | ||
import org.elasticsearch.xcontent.ToXContent; | ||
import org.elasticsearch.xcontent.XContentBuilder; | ||
import org.elasticsearch.xcontent.XContentParser.Token; | ||
|
||
|
@@ -46,6 +52,7 @@ | |
import java.util.LinkedHashMap; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
import java.util.stream.Stream; | ||
|
||
import static org.elasticsearch.index.query.AbstractQueryBuilder.DEFAULT_BOOST; | ||
|
@@ -57,6 +64,7 @@ | |
public class SparseVectorFieldMapper extends FieldMapper { | ||
|
||
public static final String CONTENT_TYPE = "sparse_vector"; | ||
public static final String SPARSE_VECTOR_INDEX_OPTIONS = "index_options"; | ||
|
||
static final String ERROR_MESSAGE_7X = "[sparse_vector] field type in old 7.x indices is allowed to " | ||
+ "contain [sparse_vector] fields, but they cannot be indexed or searched."; | ||
|
@@ -65,6 +73,12 @@ public class SparseVectorFieldMapper extends FieldMapper { | |
|
||
static final IndexVersion NEW_SPARSE_VECTOR_INDEX_VERSION = IndexVersions.NEW_SPARSE_VECTOR; | ||
static final IndexVersion SPARSE_VECTOR_IN_FIELD_NAMES_INDEX_VERSION = IndexVersions.SPARSE_VECTOR_IN_FIELD_NAMES_SUPPORT; | ||
public static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION = | ||
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: can we make this package private like the other index versions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's used in the SparseVectorQueryBuilder to ensure we use the same index version... we could add the same variable there, but, I think this would be more consistent. |
||
|
||
private final SparseVectorFieldMapper.IndexOptions indexOptions; | ||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
public static final NodeFeature SPARSE_VECTOR_INDEX_OPTIONS_FEATURE = new NodeFeature("sparse_vector_index_options_supported"); | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
private static SparseVectorFieldMapper toType(FieldMapper in) { | ||
return (SparseVectorFieldMapper) in; | ||
|
@@ -73,9 +87,23 @@ private static SparseVectorFieldMapper toType(FieldMapper in) { | |
public static class Builder extends FieldMapper.Builder { | ||
private final Parameter<Boolean> stored = Parameter.storeParam(m -> toType(m).fieldType().isStored(), false); | ||
private final Parameter<Map<String, String>> meta = Parameter.metaParam(); | ||
private final Parameter<IndexOptions> indexOptions; | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
public Builder(String name) { | ||
super(name); | ||
this.indexOptions = new Parameter<>( | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
SPARSE_VECTOR_INDEX_OPTIONS, | ||
true, | ||
() -> null, | ||
(n, c, o) -> parseIndexOptions(c, o), | ||
m -> toType(m).fieldType().indexOptions, | ||
(b, n, v) -> { | ||
if (v != null) { | ||
b.field(n, v); | ||
} | ||
}, | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Objects::toString | ||
); | ||
} | ||
|
||
public Builder setStored(boolean value) { | ||
|
@@ -85,19 +113,40 @@ public Builder setStored(boolean value) { | |
|
||
@Override | ||
protected Parameter<?>[] getParameters() { | ||
return new Parameter<?>[] { stored, meta }; | ||
return new Parameter<?>[] { stored, meta, indexOptions }; | ||
} | ||
|
||
@Override | ||
public SparseVectorFieldMapper build(MapperBuilderContext context) { | ||
return new SparseVectorFieldMapper( | ||
leafName(), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue()), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue(), indexOptions.getValue()), | ||
builderParams(this, context) | ||
); | ||
} | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return this.indexOptions; | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
private static SparseVectorFieldMapper.IndexOptions parseIndexOptions(MappingParserContext context, Object propNode) { | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (propNode == null) { | ||
return null; | ||
} | ||
|
||
Map<String, Object> indexOptionsMap = XContentMapValues.nodeMapValue(propNode, SPARSE_VECTOR_INDEX_OPTIONS); | ||
|
||
Boolean prune = IndexOptions.parseIndexOptionsPruneValue(indexOptionsMap); | ||
TokenPruningConfig pruningConfig = IndexOptions.parseIndexOptionsPruningConfig(prune, indexOptionsMap); | ||
|
||
if (prune == null && pruningConfig == null) { | ||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return null; | ||
} | ||
|
||
return new SparseVectorFieldMapper.IndexOptions(prune, pruningConfig); | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
public static final TypeParser PARSER = new TypeParser((n, c) -> { | ||
if (c.indexVersionCreated().before(PREVIOUS_SPARSE_VECTOR_INDEX_VERSION)) { | ||
deprecationLogger.warn(DeprecationCategory.MAPPINGS, "sparse_vector", ERROR_MESSAGE_7X); | ||
|
@@ -109,9 +158,24 @@ public SparseVectorFieldMapper build(MapperBuilderContext context) { | |
}, notInMultiFields(CONTENT_TYPE)); | ||
|
||
public static final class SparseVectorFieldType extends MappedFieldType { | ||
private final IndexOptions indexOptions; | ||
|
||
public SparseVectorFieldType(String name, boolean isStored, Map<String, String> meta) { | ||
this(name, isStored, meta, null); | ||
} | ||
|
||
public SparseVectorFieldType( | ||
String name, | ||
boolean isStored, | ||
Map<String, String> meta, | ||
@Nullable SparseVectorFieldMapper.IndexOptions indexOptions | ||
) { | ||
super(name, true, isStored, false, TextSearchInfo.SIMPLE_MATCH_ONLY, meta); | ||
this.indexOptions = indexOptions; | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return indexOptions; | ||
} | ||
|
||
@Override | ||
|
@@ -159,6 +223,7 @@ private static String indexedValueForSearch(Object value) { | |
|
||
private SparseVectorFieldMapper(String simpleName, MappedFieldType mappedFieldType, BuilderParams builderParams) { | ||
super(simpleName, mappedFieldType, builderParams); | ||
this.indexOptions = ((SparseVectorFieldType) mappedFieldType).getIndexOptions(); | ||
} | ||
|
||
@Override | ||
|
@@ -364,4 +429,86 @@ public void reset() { | |
} | ||
} | ||
|
||
public static class IndexOptions implements ToXContent { | ||
public static final String PRUNE_FIELD_NAME = "prune"; | ||
public static final String PRUNING_CONFIG_FIELD_NAME = "pruning_config"; | ||
|
||
final Boolean prune; | ||
final TokenPruningConfig pruningConfig; | ||
|
||
IndexOptions(@Nullable Boolean prune, @Nullable TokenPruningConfig pruningConfig) { | ||
this.prune = prune; | ||
this.pruningConfig = pruningConfig; | ||
} | ||
|
||
public Boolean getPrune() { | ||
return prune; | ||
} | ||
|
||
public TokenPruningConfig getPruningConfig() { | ||
return pruningConfig; | ||
} | ||
|
||
@Override | ||
public final boolean equals(Object other) { | ||
if (other == this) { | ||
return true; | ||
} | ||
|
||
if (other == null || getClass() != other.getClass()) { | ||
return false; | ||
} | ||
|
||
IndexOptions otherAsIndexOptions = (IndexOptions) other; | ||
return Objects.equals(prune, otherAsIndexOptions.prune) && Objects.equals(pruningConfig, otherAsIndexOptions.pruningConfig); | ||
} | ||
|
||
@Override | ||
public final int hashCode() { | ||
return Objects.hash(prune, pruningConfig); | ||
} | ||
|
||
@Override | ||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
builder.startObject(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question here - if both |
||
|
||
if (prune != null) { | ||
builder.field(PRUNE_FIELD_NAME, prune); | ||
} | ||
if (pruningConfig != null) { | ||
builder.field(PRUNING_CONFIG_FIELD_NAME, pruningConfig); | ||
} | ||
|
||
builder.endObject(); | ||
return builder; | ||
} | ||
|
||
public static Boolean parseIndexOptionsPruneValue(Map<String, Object> indexOptionsMap) { | ||
Object shouldPrune = indexOptionsMap.remove(IndexOptions.PRUNE_FIELD_NAME); | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (shouldPrune == null) { | ||
return null; | ||
} | ||
|
||
if (shouldPrune instanceof Boolean boolValue) { | ||
return boolValue; | ||
} | ||
|
||
throw new MapperParsingException("[index_options] field [prune] should be true or false"); | ||
} | ||
|
||
public static TokenPruningConfig parseIndexOptionsPruningConfig(Boolean prune, Map<String, Object> indexOptionsMap) { | ||
Object pruningConfiguration = indexOptionsMap.remove(IndexOptions.PRUNING_CONFIG_FIELD_NAME); | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (pruningConfiguration == null) { | ||
return null; | ||
} | ||
|
||
if (prune == null || prune == false) { | ||
throw new MapperParsingException("[index_options] field [pruning_config] should only be set if [prune] is set to true"); | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
Map<String, Object> pruningConfigurationMap = XContentMapValues.nodeMapValue(pruningConfiguration, PRUNING_CONFIG_FIELD_NAME); | ||
|
||
return TokenPruningConfig.parseFromMap(pruningConfigurationMap); | ||
} | ||
} | ||
} |
Uh oh!
There was an error while loading. Please reload this page.