Skip to content

Enable synthetic source on normalized keyword mappings #126623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
2a878cf
add tests with normalizer mappings
not-napoleon Apr 7, 2025
0720fe7
fix KeywordFieldBlockLoaderTests
not-napoleon Apr 10, 2025
625938f
Update docs/changelog/126623.yaml
not-napoleon Apr 10, 2025
e26e543
proof of concept
not-napoleon Apr 10, 2025
c8a0025
better, but not done
not-napoleon Apr 10, 2025
f8194ae
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Apr 10, 2025
6b20735
[CI] Auto commit changes from spotless
elasticsearchmachine Apr 10, 2025
e829823
account for normalizing null values
not-napoleon Apr 11, 2025
d2efb0b
remove code I'd only commened out
not-napoleon Apr 11, 2025
c2a11e3
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Apr 11, 2025
417b084
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Apr 11, 2025
949f42f
fix yaml tests
not-napoleon Apr 11, 2025
0ed6a2b
add test for opting out via keep all
not-napoleon Apr 15, 2025
8d29695
Update docs/changelog/126623.yaml
not-napoleon May 27, 2025
4478a76
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon May 27, 2025
e1f5958
fix merge semantic confilct
not-napoleon May 27, 2025
70dd8e7
changelog
not-napoleon May 27, 2025
0984db0
add test feature to gate the tests
not-napoleon May 28, 2025
63338d7
fix changelog, again
not-napoleon May 28, 2025
8c91da5
fix yaml
not-napoleon May 28, 2025
9110abe
remove no commit
not-napoleon May 28, 2025
c747219
skip the rest compatbility test for the breaking change
not-napoleon May 28, 2025
35153c1
unvar
not-napoleon Jun 3, 2025
aed9477
Feedback from MVG on fields with both keyword and text subfields
not-napoleon Jun 4, 2025
5c9f989
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 4, 2025
270f066
fix exact subfield tests
not-napoleon Jun 4, 2025
5039f3c
skip BWC for tests we expect to break
not-napoleon Jun 4, 2025
2fda1a4
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 5, 2025
a739e87
put the Rest Compatibility skip in the right place
not-napoleon Jun 5, 2025
030bb9c
Merge remote-tracking branch 'refs/remotes/not-napoleon/generate-mapp…
not-napoleon Jun 5, 2025
10cf2d1
Update docs/changelog/126623.yaml
not-napoleon Jun 5, 2025
d28b168
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 5, 2025
4ed5e3e
Update changelog
not-napoleon Jun 5, 2025
f158c00
Merge branch 'main' into generate-mappings-with-normalizers
not-napoleon Jun 6, 2025
fbe4adb
Update docs/changelog/126623.yaml
not-napoleon Jun 6, 2025
0ca54e9
fix changelog, again
not-napoleon Jun 6, 2025
86e5b22
response to PR feedback
not-napoleon Jun 6, 2025
63bf7d1
Revert "Feedback from MVG on fields with both keyword and text subfie…
not-napoleon Jun 10, 2025
43b1075
Revert "fix exact subfield tests"
not-napoleon Jun 10, 2025
2b9d384
leaving a note for future investigation
not-napoleon Jun 11, 2025
3b33b15
Update docs/changelog/126623.yaml
not-napoleon Jun 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/changelog/126623.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
pr: 126623
summary: Enable synthetic source on normalized keyword mappings
area: Mapping
type: "breaking, enhancement"
issues:
- 124369
- 121358
breaking:
title: Enable synthetic source on normalized keyword mappings
area: Mapping
details: |-
This changes the default behavior for Synthetic Source on keyword fields using normalizers. Prior to this change, normalized keywords were always stored to allow returning the non-normalized values. Under this change, such field will NOT be stored (i.e they will be synthesized from the index when returning source, like all other synthetic source fields). This should result in considerable space improvement for this use case.
Users can opt out of this behavior on a per-field basis by setting `synthetic_source_keep` to `all` on the field.
impact: "By default, normalized keyword fields in synthetic source indices will\
\ no longer return the non-normalized value in the source."
notable: false
1 change: 1 addition & 0 deletions rest-api-spec/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,5 @@ tasks.named("yamlRestCompatTestTransform").configure ({ task ->
task.skipTest("indices.create/21_synthetic_source_stored/field param - keep root array", "Synthetic source keep arrays now stores leaf arrays natively")
task.skipTest("cluster.info/30_info_thread_pool/Cluster HTTP Info", "The search_throttled thread pool has been removed")
task.skipTest("synonyms/80_synonyms_from_index/Fail loading synonyms from index if synonyms_set doesn't exist", "Synonyms do no longer fail if the synonyms_set doesn't exist")
task.skipTest("mget/90_synthetic_source/keyword with normalizer", "Normalized keywords now use synthetic source")
})
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ setup:
---
keyword:
- requires:
cluster_features: ["gte_v8.4.0"]
cluster_features: [ "gte_v8.4.0" ]
reason: introduced in 8.4.0

- do:
Expand All @@ -23,15 +23,15 @@ keyword:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
kwd: foo

- do:
index:
index: test
id: 2
index: test
id: 2
body:
kwd: bar

Expand All @@ -40,21 +40,24 @@ keyword:
mget:
index: test
body:
ids: [1, 2]
- match: {docs.0._index: "test"}
- match: {docs.0._id: "1"}
ids: [ 1, 2 ]
- match: { docs.0._index: "test" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
kwd: foo

- match: {docs.1._index: "test"}
- match: {docs.1._id: "2"}
- match: { docs.1._index: "test" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
kwd: bar

---
keyword with normalizer:
- requires:
cluster_features: "mapper.keyword.keyword_normalizer_synthetic_source"
reason: "Behavior changed in #126623"
- do:
indices.create:
index: test-keyword-with-normalizer
Expand Down Expand Up @@ -113,7 +116,94 @@ keyword with normalizer:
mget:
index: test-keyword-with-normalizer
body:
ids: [ 1, 2, 3 ]
ids: [ 1, 2, 3 ]
- match: { docs.0._index: "test-keyword-with-normalizer" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
keyword: "the quick brown fox jumps over the lazy dog"
keyword_with_ignore_above: "the Quick Brown Fox jumps over the lazy Dog"
keyword_without_doc_values: "the Quick Brown Fox jumps over the lazy Dog"

- match: { docs.1._index: "test-keyword-with-normalizer" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
keyword: "the five boxing wizards jump quickly"
keyword_with_ignore_above: "The five BOXING wizards jump Quickly"
keyword_without_doc_values: "The five BOXING wizards jump Quickly"

- match: { docs.2._index: "test-keyword-with-normalizer" }
- match: { docs.2._id: "3" }
- match:
docs.2._source:
keyword: [ "do or do not, there is no try", "may the force be with you!" ]
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]

---
keyword with normalizer, source keep mode all:
- do:
indices.create:
index: test-keyword-with-normalizer
body:
settings:
analysis:
normalizer:
lowercase:
type: custom
filter:
- lowercase
index:
mapping.source.mode: synthetic

mappings:
properties:
keyword:
type: keyword
normalizer: lowercase
synthetic_source_keep: all
keyword_with_ignore_above:
type: keyword
normalizer: lowercase
ignore_above: 10
keyword_without_doc_values:
type: keyword
normalizer: lowercase
doc_values: false

- do:
index:
index: test-keyword-with-normalizer
id: 1
body:
keyword: "the Quick Brown Fox jumps over the lazy Dog"
keyword_with_ignore_above: "the Quick Brown Fox jumps over the lazy Dog"
keyword_without_doc_values: "the Quick Brown Fox jumps over the lazy Dog"

- do:
index:
index: test-keyword-with-normalizer
id: 2
body:
keyword: "The five BOXING wizards jump Quickly"
keyword_with_ignore_above: "The five BOXING wizards jump Quickly"
keyword_without_doc_values: "The five BOXING wizards jump Quickly"

- do:
index:
index: test-keyword-with-normalizer
id: 3
body:
keyword: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_with_ignore_above: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]
keyword_without_doc_values: [ "May the FORCE be with You!", "Do or Do Not, There is no Try" ]

- do:
mget:
index: test-keyword-with-normalizer
body:
ids: [ 1, 2, 3 ]
- match: { docs.0._index: "test-keyword-with-normalizer" }
- match: { docs.0._id: "1" }
- match:
Expand Down Expand Up @@ -141,7 +231,7 @@ keyword with normalizer:
---
stored text:
- requires:
cluster_features: ["gte_v8.5.0"]
cluster_features: [ "gte_v8.5.0" ]
reason: introduced in 8.5.0

- do:
Expand All @@ -159,39 +249,39 @@ stored text:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
text: the quick brown fox

- do:
index:
index: test
id: 2
index: test
id: 2
body:
text: jumped over the lazy dog

- do:
mget:
index: test
body:
ids: [1, 2]
- match: {docs.0._index: "test"}
- match: {docs.0._id: "1"}
ids: [ 1, 2 ]
- match: { docs.0._index: "test" }
- match: { docs.0._id: "1" }
- match:
docs.0._source:
text: the quick brown fox

- match: {docs.1._index: "test"}
- match: {docs.1._id: "2"}
- match: { docs.1._index: "test" }
- match: { docs.1._id: "2" }
- match:
docs.1._source:
text: jumped over the lazy dog

---
force_synthetic_source_ok:
- requires:
cluster_features: ["gte_v8.4.0"]
cluster_features: [ "gte_v8.4.0" ]
reason: introduced in 8.4.0

- do:
Expand All @@ -210,15 +300,15 @@ force_synthetic_source_ok:

- do:
index:
index: test
id: 1
index: test
id: 1
body:
obj.kwd: foo

- do:
index:
index: test
id: 2
index: test
id: 2
body:
obj:
kwd: bar
Expand All @@ -228,7 +318,7 @@ force_synthetic_source_ok:
mget:
index: test
body:
ids: [1, 2]
ids: [ 1, 2 ]
- match:
docs.0._source:
obj.kwd: foo
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
import org.elasticsearch.common.lucene.search.AutomatonQueries;
import org.elasticsearch.common.unit.Fuzziness;
import org.elasticsearch.core.Nullable;
import org.elasticsearch.features.NodeFeature;
import org.elasticsearch.index.IndexMode;
import org.elasticsearch.index.IndexSortConfig;
import org.elasticsearch.index.IndexVersion;
Expand Down Expand Up @@ -94,6 +95,10 @@
*/
public final class KeywordFieldMapper extends FieldMapper {

public static final NodeFeature KEYWORD_NORMALIZER_SYNTHETIC_SOURCE = new NodeFeature(
"mapper.keyword.keyword_normalizer_synthetic_source"
);

private static final Logger logger = LogManager.getLogger(KeywordFieldMapper.class);

public static final String CONTENT_TYPE = "keyword";
Expand Down Expand Up @@ -1276,11 +1281,10 @@ private String originalName() {

@Override
protected SyntheticSourceSupport syntheticSourceSupport() {
if (hasNormalizer()) {
// NOTE: no matter if we have doc values or not we use fallback synthetic source
// to store the original value whose doc values would be altered by the normalizer
return SyntheticSourceSupport.FALLBACK;
}
/* NOTE: we allow enabling synthetic source on Keyword fields with a Normalizer, even though the returned synthetic value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

may not perfectly match the original, pre-normalization, value. This reduces the storage space required for normalized keyword
fields, which otherwise perform quite poorly in terms of storage.
*/

if (fieldType.stored() || hasDocValues) {
return new SyntheticSourceSupport.Native(() -> syntheticFieldLoader(fullPath(), leafName()));
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ public Set<NodeFeature> getTestFeatures() {
NPE_ON_DIMS_UPDATE_FIX,
RESCORE_ZERO_VECTOR_QUANTIZED_VECTOR_MAPPING,
USE_DEFAULT_OVERSAMPLE_VALUE_FOR_BBQ,
IVF_FORMAT_CLUSTER_FEATURE
IVF_FORMAT_CLUSTER_FEATURE,
KeywordFieldMapper.KEYWORD_NORMALIZER_SYNTHETIC_SOURCE
);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,12 @@ public Builder(String name, IndexVersion indexCreatedVersion, IndexAnalyzers ind
// storing the field without requiring users to explicitly set 'store'.
//
// If 'store' parameter was explicitly provided we'll reject the request.

/* NOTE: I am fairly sure the above is strictly not true. Testing seems
show that we do not reject the mapping, even when there is no compatible keyword
field and the mapping sets store to false. --MT 2025-06-11
*/

this.store = Parameter.storeParam(
m -> ((TextFieldMapper) m).store,
() -> isSyntheticSourceEnabled && multiFieldsBuilder.hasSyntheticSourceCompatibleKeywordField() == false
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import org.elasticsearch.index.mapper.MappedFieldType;

import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Objects;
import java.util.function.Function;
Expand All @@ -34,19 +35,21 @@ protected Object expected(Map<String, Object> fieldMapping, Object value, TestCo
public static Object expectedValue(Map<String, Object> fieldMapping, Object value, Params params, TestContext testContext) {
var nullValue = (String) fieldMapping.get("null_value");

var ignoreAbove = fieldMapping.get("ignore_above") == null
int ignoreAbove = fieldMapping.get("ignore_above") == null
? Integer.MAX_VALUE
: ((Number) fieldMapping.get("ignore_above")).intValue();

String normalizerName = (String) fieldMapping.get("normalizer");

if (value == null) {
return convert(null, nullValue, ignoreAbove);
return convert(null, nullValue, ignoreAbove, normalizerName);
}

if (value instanceof String s) {
return convert(s, nullValue, ignoreAbove);
return convert(s, nullValue, ignoreAbove, normalizerName);
}

Function<Stream<String>, Stream<BytesRef>> convertValues = s -> s.map(v -> convert(v, nullValue, ignoreAbove))
Function<Stream<String>, Stream<BytesRef>> convertValues = s -> s.map(v -> convert(v, nullValue, ignoreAbove, normalizerName))
.filter(Objects::nonNull);

boolean hasDocValues = hasDocValues(fieldMapping, true);
Expand All @@ -67,15 +70,21 @@ public static Object expectedValue(Map<String, Object> fieldMapping, Object valu
return maybeFoldList(resultList);
}

private static BytesRef convert(String value, String nullValue, int ignoreAbove) {
private static BytesRef convert(String value, String nullValue, int ignoreAbove, String normalizer) {
if (value == null) {
if (nullValue != null) {
value = nullValue;
} else {
return null;
}
}

if (Objects.equals(normalizer, "lowercase")) {
// hopefully not Turkish...
value = value.toLowerCase(Locale.ROOT);
} else if (normalizer != null) {
// we probably can't get here anyway, since MapperServiceTestCase only initializes the lowercase normalizer
throw new IllegalArgumentException("normalizer [" + normalizer + "] not supported for block loader tests");
}
return value.length() <= ignoreAbove ? new BytesRef(value) : null;
}
}
Loading
Loading