Skip to content

Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

martijnvg
Copy link
Member

@martijnvg martijnvg commented Oct 16, 2024

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.

…ock loaders.

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes `_ignored_source` field as a required stored field.

Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886
@martijnvg martijnvg added :Analytics/Compute Engine Analytics in ES|QL :StorageEngine/Mapping The storage related side of mappings labels Oct 16, 2024
@martijnvg martijnvg requested a review from dnhatn October 16, 2024 16:35
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @martijnvg


/**
* Loads values from {@code _source}. This whole process is very slow and cast-tastic,
* so it doesn't really try to avoid megamorphic invocations. It's just going to be
* slow.
*/
public abstract class BlockSourceReader implements BlockLoader.RowStrideReader {

// _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo


// _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used,
// then it just doesn't get loaded.
static final StoredFieldsSpec NEEDS_SOURCE_AND_IGNORED_SOURCE = new StoredFieldsSpec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should avoid requesting _source and only read _ignored_source when synthetic source is enabled. However, we should get this in to ensure correctness, and make the blockloader more selective in a follow-up. I can help with that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few unit compute engine tests failed, and instead of adjusting the tests, I made the block loaders a little more selective, based on whether synthetic source is used. See: 6702a2e

@martijnvg martijnvg added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Oct 17, 2024
@martijnvg martijnvg changed the title Support reading ignored source as part of value source loading via block loaders. Include loading ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. Oct 17, 2024
@martijnvg martijnvg marked this pull request as ready for review October 17, 2024 13:21
@martijnvg martijnvg requested a review from a team as a code owner October 17, 2024 13:21
@martijnvg martijnvg requested a review from dnhatn October 17, 2024 13:21
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes look good. The failing test appears to be relevant.

StandardVersusLogsIndexModeChallengeRestIT > testEsqlTermsAggregationByMethod FAILED
    org.elasticsearch.client.WarningFailureException: method [POST], host [http://[::1]:38381], URI [/_query], status line [HTTP/1.1 200 OK]
    Warnings: [Field [method] cannot be retrieved, it is unsupported or not indexed; returning null]
    {"took":39,"columns":[{"name":"count(*)","type":"long"},{"name":"method","type":"text"}],"values":[[129,null]]}
        at __randomizedtesting.SeedInfo.seed([7C34733CCD9E9B9A:AB344202DE16E96D]:0)
        at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
        at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
        at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esql(AbstractChallengeRestTest.java:290)
        at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esqlContender(AbstractChallengeRestTest.java:284)
        at app//org.elasticsearch.datastreams.logsdb.qa.StandardVersusLogsIndexModeChallengeRestIT.testEsqlTermsAggregationByMethod(StandardVersusLogsIndexModeChallengeRestIT.java:317)
```

If text fields are not stored, then keyword sub fields
can be used to syntesize fields for text parent field.
@martijnvg
Copy link
Member Author

The failing test appears to be relevant.

Yes, the randomized test failure showed that TextFieldType#blockLoader(...) wasn't taking into account keyword sub fields if a text field didn't store source. I pushed c6e303e to address this.

@martijnvg martijnvg enabled auto-merge (squash) October 17, 2024 16:04
@martijnvg martijnvg changed the title Include loading ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. Oct 17, 2024
@martijnvg martijnvg added v8.17.0 and removed v8.16.0 labels Oct 17, 2024
@martijnvg
Copy link
Member Author

Looks like the CI check here is stuck. All PR CI did complete successfully.

@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x

martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Oct 18, 2024
…ReaderOperator via BlockSourceReader. (elastic#114903)

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
elasticsearchmachine pushed a commit that referenced this pull request Oct 18, 2024
…ReaderOperator via BlockSourceReader. (#114903) (#115064)

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
lkts pushed a commit to lkts/elasticsearch that referenced this pull request Oct 18, 2024
…ReaderOperator via BlockSourceReader. (elastic#114903)

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Oct 21, 2024
martijnvg added a commit that referenced this pull request Oct 23, 2024
…equired stored fields (#115114)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes #115076
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Oct 23, 2024
…equired stored fields (elastic#115114)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes elastic#115076
elasticsearchmachine pushed a commit that referenced this pull request Oct 23, 2024
…equired stored fields (#115114) (#115390)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes #115076
smalyshev pushed a commit to smalyshev/elasticsearch that referenced this pull request Oct 23, 2024
…equired stored fields (elastic#115114)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes elastic#115076
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…ReaderOperator via BlockSourceReader. (elastic#114903)

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
…equired stored fields (elastic#115114)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes elastic#115076
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…ReaderOperator via BlockSourceReader. (elastic#114903)

Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.

This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.

Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
…equired stored fields (elastic#115114)

If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.

This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.

This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.

Closes elastic#115076
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Compute Engine Analytics in ES|QL auto-backport Automatically create backport pull requests when merged >non-issue :StorageEngine/Mapping The storage related side of mappings Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine v8.17.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants