Skip to content

Adding support to exclude semantic_text subfields #127664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Samiul-TheSoccerFan
Copy link
Contributor

Update the fieldCaps API to exclude semantic_text subfields in both legacy and new formats.

Legacy format:

setup:


PUT test-field-caps-with-legacy
{
    "settings": {
        "index.mapping.semantic_text.use_legacy_format": true
    },
    "mappings": {
        "properties": {
            "test_field_legacy": {
                "type": "semantic_text",
                "inference_id": ".elser-2-elasticsearch"
            },
            "non_infer_field_legacy": {
                "type": "text"
            },
            "sparse_vector_legacy": {
                "type": "sparse_vector"
            },
            "dense_vector_legacy": {
                "type": "dense_vector",
                "dims": 3,
                "similarity": "l2_norm"
            }
        }
    }
}

PUT test-field-caps-with-legacy/_doc/doc1
{
    "test_field_legacy": "these are not the droids you're looking for. He's free to go around",
    "sparse_vector_legacy": {
        "these": 1,
        "are": 2,
        "not": 3
    },
    "dense_vector_legacy": [1, 2, 3]
}

Query:

GET /_field_caps?allow_no_indices=true&fields=*&index=test*&ignore_unavailable=true&expand_wildcards=open

Response before update (Skimmed):

{
  "indices": [
    "test-field-caps-with-legacy"
  ],
  "fields": {
    "non_infer_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks.text": {
      "keyword": {
        "type": "keyword",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference": {
      "object": {
        "type": "object",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "sparse_vector_legacy": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks.embeddings": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector_legacy": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy.inference.chunks": {
      "nested": {
        "type": "nested",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    }
  }
}

Response after update (Skimmed):

{
  "indices": [
    "test-field-caps-with-legacy"
  ],
  "fields": {
    "non_infer_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "sparse_vector_legacy": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field_legacy": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector_legacy": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    }
  }
}

new format:

setup:

PUT test-field-caps
{
    "mappings": {
        "properties": {
            "test_field": {
                "type": "semantic_text",
                "inference_id": ".elser-2-elasticsearch"
            },
            "non_infer_field": {
                "type": "text"
            },
            "sparse_vector": {
                "type": "sparse_vector"
            },
            "dense_vector": {
                "type": "dense_vector",
                "dims": 3,
                "similarity": "l2_norm"
            }
        }
    }
}

PUT test-field-caps/_doc/doc1
{
    "test_field": "these are not the droids you're looking for. He's free to go around",
    "sparse_vector": {
        "these": 1,
        "are": 2,
        "not": 3
    },
    "dense_vector": [1, 2, 3]
}

Query:

GET /_field_caps?allow_no_indices=true&fields=*&index=test*&ignore_unavailable=true&expand_wildcards=open

Response before update (Skimmed):

{
  "indices": [
    "test-field-caps"
  ],
  "fields": {
    "_ignored_source": {
      "_ignored_source": {
        "type": "_ignored_source",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "non_infer_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_index": {
      "_index": {
        "type": "_index",
        "metadata_field": true,
        "searchable": true,
        "aggregatable": true
      }
    },
    "_feature": {
      "_feature": {
        "type": "_feature",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "sparse_vector": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks.embeddings": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks.offset": {
      "offset_source": {
        "type": "offset_source",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_inference_fields": {
      "_inference_fields": {
        "type": "_inference_fields",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "test_field.inference": {
      "object": {
        "type": "object",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    },
    "dense_vector": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "test_field.inference.chunks": {
      "nested": {
        "type": "nested",
        "metadata_field": false,
        "searchable": false,
        "aggregatable": false
      }
    }
  }
}

Response after update (Skimmed):

{
  "indices": [
    "test-field-caps"
  ],
  "fields": {
    "test_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "_inference_fields": {
      "_inference_fields": {
        "type": "_inference_fields",
        "metadata_field": true,
        "searchable": false,
        "aggregatable": false
      }
    },
    "non_infer_field": {
      "text": {
        "type": "text",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "sparse_vector": {
      "sparse_vector": {
        "type": "sparse_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },
    "dense_vector": {
      "dense_vector": {
        "type": "dense_vector",
        "metadata_field": false,
        "searchable": true,
        "aggregatable": false
      }
    },    
  }
}

@Samiul-TheSoccerFan Samiul-TheSoccerFan added >enhancement v9.1.0 :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search :SearchOrg/Relevance Label for the Search (solution/org) Relevance team labels May 2, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @Samiul-TheSoccerFan, I've created a changelog YAML for you.

Comment on lines 365 to 367
- requires:
cluster_features: "gte_v8.16.0"
reason: field_caps support for semantic_text added in 8.16.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define a new cluster feature? As per my understanding, these fields are not expected from field_caps API so excluding these should not have an impact on the API level or discover. We have also covered backward compatibility through other yaml file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to create a test feature for these tests.

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @Mikep86 's comments in Slack, but good start!

Comment on lines 365 to 367
- requires:
cluster_features: "gte_v8.16.0"
reason: field_caps support for semantic_text added in 8.16.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to create a test feature for these tests.

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! We need to tweak the approach a bit but this is getting closer.

@Samiul-TheSoccerFan Samiul-TheSoccerFan marked this pull request as ready for review May 8, 2025 14:17
@elasticsearchmachine elasticsearchmachine added Team:SearchOrg Meta label for the Search Org (Enterprise Search) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch labels May 8, 2025
@elasticsearchmachine elasticsearchmachine added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch Team:Search - Relevance The Search organization Search Relevance team labels May 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just some nits and tests to clean up

@@ -448,10 +455,12 @@ public KeywordFieldMapper build(MapperBuilderContext context) {
indexCreatedVersion,
IndexVersions.SYNTHETIC_SOURCE_STORE_ARRAYS_NATIVELY_KEYWORD
);

KeywordFieldType keywordFieldType = buildFieldType(context, fieldtype);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can call buildFieldType(context, fieldtype) in-line when creating KeywordFieldMapper, the command stays short and readable

Comment on lines 187 to 189
/**
* @return true if fieldType is subfields of semantic_text type
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation should be more generic. Semantic text is only one case where we want to exclude a field from field caps, there could be others in the future.

Suggested change
/**
* @return true if fieldType is subfields of semantic_text type
*/
/**
* @return true if the field should be excluded from field caps
*/

@@ -83,18 +84,25 @@ public Builder setStored(boolean value) {
return this;
}

public Builder setExcludeFromFieldCaps(boolean value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we change this to excludeFromFieldCaps to match the other builders?

@@ -134,6 +134,9 @@ public class SemanticTextFieldMapper extends FieldMapper implements InferenceFie
public static final NodeFeature SEMANTIC_TEXT_SKIP_INFERENCE_FIELDS = new NodeFeature("semantic_text.skip_inference_fields");
public static final NodeFeature SEMANTIC_TEXT_BIT_VECTOR_SUPPORT = new NodeFeature("semantic_text.bit_vector_support");
public static final NodeFeature SEMANTIC_TEXT_SUPPORT_CHUNKING_CONFIG = new NodeFeature("semantic_text.support_chunking_config");
public static final NodeFeature SEMANTIC_TEXT_SUB_FIELDS_EXCLUDE_FROM_FIELD_CAPS = new NodeFeature(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: SEMANTIC_TEXT_EXCLUDE_SUB_FIELDS_FROM_FIELD_CAPS matches the node feature name better

Comment on lines 378 to 381
- not_exists: fields.sparse_field.chunks.embeddings
- not_exists: fields.sparse_field.chunks.offset
- not_exists: fields.dense_field.chunks.embeddings
- not_exists: fields.dense_field.chunks.offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be *.inference.chunks.*?

Also, can we check that field caps excludes inference and inference.chunks here too?

"Field caps exclude chunks and embedding fields":
- requires:
cluster_features: "semantic_text.exclude_sub_fields_from_field_caps"
reason: field caps api exclude semantic_text subfields from 9.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.19.0 & 9.1.0

"Field caps exclude chunks embedding and text fields":
- requires:
cluster_features: "semantic_text.exclude_sub_fields_from_field_caps"
reason: field caps api exclude semantic_text subfields from 9.1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.19.0 & 9.1.0

Comment on lines 325 to 328
- not_exists: fields.sparse_field.inference.chunks.embeddings
- not_exists: fields.sparse_field.inference.chunks.text
- not_exists: fields.dense_field.inference.chunks.embeddings
- not_exists: fields.dense_field.inference.chunks.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that field caps excludes inference and inference.chunks here too?

Comment on lines +1049 to +1051
var chunkTextField = new KeywordFieldMapper.Builder(TEXT_FIELD, indexVersionCreated).indexed(false)
.docValues(false)
.excludeFromFieldCaps(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi Do we need index version checks around setting this new flag on the mapping?

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements! No additional feedback other than what Mike already flagged.

@Samiul-TheSoccerFan
Copy link
Contributor Author

@elasticmachine update branch

@javanna
Copy link
Member

javanna commented May 9, 2025

Thanks for working on this. Could we be explicit in the description of this PR around the motivations behind this change? There is no opt-in or opt-out, hence we are just declaring that these fields are not useful in the field caps output at all times?

I am guessing that the issue is around Kibana showing these in Discover? Have we considered alternatives?

@Mikep86
Copy link
Contributor

Mikep86 commented May 9, 2025

@javanna

There is no opt-in or opt-out, hence we are just declaring that these fields are not useful in the field caps output at all times?

I am guessing that the issue is around Kibana showing these in Discover? Have we considered alternatives?

The subfields we are excluding are semantic_text implementation details that the user should not rely on. Yes, the primary motivation here is Kibana showing them in Discover, we get a fair amount of questions about why these fields show up and we have to reiterate the answer that these fields are implementation details and should not be used externally.

Hiding these fields from field caps codifies that these are implementation details at the API level, creating a consistent experience in this respect for Kibana and API users.

Are there alternatives you're aware of that would achieve the same result?

Edit: Another thing to point out here is that these subfields are not visible in the mappings returned to the user (on purpose), further anchoring them as implementation details

@Mikep86
Copy link
Contributor

Mikep86 commented May 9, 2025

I talked with @jimczi and we have an alternative implementation that does not involve adding a flag to MappedFieldType. The end effect on the fields caps API output is the same though, so the tests should remain valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :Search Relevance/Vectors Vector search :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch Team:SearchOrg Meta label for the Search Org (Enterprise Search) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants