Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

martijnvg · 2024-10-16T12:52:21Z

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…ock loaders. Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes `_ignored_source` field as a required stored field. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886

dnhatn

LGTM. Thanks @martijnvg

dnhatn · 2024-10-16T23:01:27Z

server/src/main/java/org/elasticsearch/index/mapper/BlockSourceReader.java


 /**
 * Loads values from {@code _source}. This whole process is very slow and cast-tastic,
 * so it doesn't really try to avoid megamorphic invocations. It's just going to be
 * slow.
 */
 public abstract class BlockSourceReader implements BlockLoader.RowStrideReader {
+
+    // _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used,


dnhatn · 2024-10-16T23:01:45Z

server/src/main/java/org/elasticsearch/index/mapper/BlockSourceReader.java

+
+    // _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used,
+    // then it just doesn't get loaded.
+    static final StoredFieldsSpec NEEDS_SOURCE_AND_IGNORED_SOURCE = new StoredFieldsSpec(


Ideally, we should avoid requesting _source and only read _ignored_source when synthetic source is enabled. However, we should get this in to ensure correctness, and make the blockloader more selective in a follow-up. I can help with that.

A few unit compute engine tests failed, and instead of adjusting the tests, I made the block loaders a little more selective, based on whether synthetic source is used. See: 6702a2e

…d_ignored_source

…c can be selected.

elasticsearchmachine · 2024-10-17T13:21:58Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2024-10-17T13:21:59Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

…d_ignored_source

dnhatn

The new changes look good. The failing test appears to be relevant.

StandardVersusLogsIndexModeChallengeRestIT > testEsqlTermsAggregationByMethod FAILED
    org.elasticsearch.client.WarningFailureException: method [POST], host [http://[::1]:38381], URI [/_query], status line [HTTP/1.1 200 OK]
    Warnings: [Field [method] cannot be retrieved, it is unsupported or not indexed; returning null]
    {"took":39,"columns":[{"name":"count(*)","type":"long"},{"name":"method","type":"text"}],"values":[[129,null]]}
        at __randomizedtesting.SeedInfo.seed([7C34733CCD9E9B9A:AB344202DE16E96D]:0)
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
        at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esql(AbstractChallengeRestTest.java:290)
        at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esqlContender(AbstractChallengeRestTest.java:284)
        at app//org.elasticsearch.datastreams.logsdb.qa.StandardVersusLogsIndexModeChallengeRestIT.testEsqlTermsAggregationByMethod(StandardVersusLogsIndexModeChallengeRestIT.java:317)
```

If text fields are not stored, then keyword sub fields can be used to syntesize fields for text parent field.

martijnvg · 2024-10-17T16:02:26Z

The failing test appears to be relevant.

Yes, the randomized test failure showed that TextFieldType#blockLoader(...) wasn't taking into account keyword sub fields if a text field didn't store source. I pushed c6e303e to address this.

…d_ignored_source

martijnvg · 2024-10-18T05:48:14Z

Looks like the CI check here is stuck. All PR CI did complete successfully.

elasticsearchmachine · 2024-10-18T05:50:21Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…ReaderOperator via BlockSourceReader. (#114903) (#115064) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…o fully support synthetic source.

…equired stored fields (#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes #115076

…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076

…equired stored fields (#115114) (#115390) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes #115076

…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076

…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076

…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076

martijnvg added :Analytics/Compute Engine Analytics in ES|QL :StorageEngine/Mapping The storage related side of mappings labels Oct 16, 2024

elasticsearchmachine added the v9.0.0 label Oct 16, 2024

martijnvg added 2 commits October 16, 2024 16:41

iter

3bcf058

alter unit test

17d3059

martijnvg added the >non-issue label Oct 16, 2024

spotless

4db364a

martijnvg requested a review from dnhatn October 16, 2024 16:35

dnhatn approved these changes Oct 16, 2024

View reviewed changes

martijnvg added 4 commits October 17, 2024 08:55

Merge remote-tracking branch 'es/main' into esql_synthetic_source_loa…

980675e

…d_ignored_source

Add sourceMode to SourceBlockLoader so that the right StoredFieldsSpe…

6702a2e

…c can be selected.

fixed comment

e4b08f4

fixed field mapper tests

4073c7e

martijnvg added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Oct 17, 2024

martijnvg changed the title ~~Support reading ignored source as part of value source loading via block loaders.~~ Include loading ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. Oct 17, 2024

alter test based on previous commit

70e8d0f

martijnvg marked this pull request as ready for review October 17, 2024 13:21

martijnvg requested a review from a team as a code owner October 17, 2024 13:21

martijnvg requested a review from dnhatn October 17, 2024 13:21

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 17, 2024

elasticsearchmachine added the Team:StorageEngine label Oct 17, 2024

martijnvg added 2 commits October 17, 2024 15:22

Merge remote-tracking branch 'es/main' into esql_synthetic_source_loa…

9da510b

…d_ignored_source

added a random test that groups by a field that could be disabled

2880403

dnhatn approved these changes Oct 17, 2024

View reviewed changes

Take into account text fields with keyword sub fields.

c6e303e

If text fields are not stored, then keyword sub fields can be used to syntesize fields for text parent field.

Merge remote-tracking branch 'es/main' into esql_synthetic_source_loa…

b5dd9d6

…d_ignored_source

martijnvg enabled auto-merge (squash) October 17, 2024 16:04

martijnvg changed the title ~~Include loading ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader.~~ Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. Oct 17, 2024

martijnvg added v8.17.0 and removed v8.16.0 labels Oct 17, 2024

martijnvg disabled auto-merge October 18, 2024 05:48

martijnvg merged commit c62a96c into elastic:main Oct 18, 2024
14 of 16 checks passed

martijnvg mentioned this pull request Oct 18, 2024

[8.x] Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. (#114903) #115064

Merged

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Oct 21, 2024

Undo changes in elastic#114903 and alter ValuesSourceReaderOperator t…

fae9ffb

…o fully support synthetic source.

martijnvg mentioned this pull request Oct 21, 2024

Sometimes delegate to SourceLoader in ValueSourceReaderOperator for required stored fields #115114

Merged

martijnvg mentioned this pull request Oct 23, 2024

[8.x] Sometimes delegate to SourceLoader in ValueSourceReaderOperator for required stored fields #115390

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

martijnvg commented Oct 16, 2024 •

edited

Loading

dnhatn left a comment

dnhatn Oct 16, 2024

dnhatn Oct 16, 2024

martijnvg Oct 17, 2024

elasticsearchmachine commented Oct 17, 2024

elasticsearchmachine commented Oct 17, 2024

dnhatn left a comment

martijnvg commented Oct 17, 2024

martijnvg commented Oct 18, 2024

elasticsearchmachine commented Oct 18, 2024

Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

Conversation

martijnvg commented Oct 16, 2024 • edited Loading

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn Oct 16, 2024

Choose a reason for hiding this comment

dnhatn Oct 16, 2024

Choose a reason for hiding this comment

martijnvg Oct 17, 2024

Choose a reason for hiding this comment

elasticsearchmachine commented Oct 17, 2024

elasticsearchmachine commented Oct 17, 2024

dnhatn left a comment

Choose a reason for hiding this comment

martijnvg commented Oct 17, 2024

martijnvg commented Oct 18, 2024

elasticsearchmachine commented Oct 18, 2024

💚 Backport successful

martijnvg commented Oct 16, 2024 •

edited

Loading