-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include ignored source as part of loading field values in ValueSourceReaderOperator via BlockSourceReader. #114903
Conversation
…ock loaders. Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes `_ignored_source` field as a required stored field. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @martijnvg
|
||
/** | ||
* Loads values from {@code _source}. This whole process is very slow and cast-tastic, | ||
* so it doesn't really try to avoid megamorphic invocations. It's just going to be | ||
* slow. | ||
*/ | ||
public abstract class BlockSourceReader implements BlockLoader.RowStrideReader { | ||
|
||
// _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo
|
||
// _ignored_source is needed ofr synthetic source is needed for, in case stored source (default) is used, | ||
// then it just doesn't get loaded. | ||
static final StoredFieldsSpec NEEDS_SOURCE_AND_IGNORED_SOURCE = new StoredFieldsSpec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should avoid requesting _source and only read _ignored_source
when synthetic source is enabled. However, we should get this in to ensure correctness, and make the blockloader more selective in a follow-up. I can help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few unit compute engine tests failed, and instead of adjusting the tests, I made the block loaders a little more selective, based on whether synthetic source is used. See: 6702a2e
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes look good. The failing test appears to be relevant.
StandardVersusLogsIndexModeChallengeRestIT > testEsqlTermsAggregationByMethod FAILED
org.elasticsearch.client.WarningFailureException: method [POST], host [http://[::1]:38381], URI [/_query], status line [HTTP/1.1 200 OK]
Warnings: [Field [method] cannot be retrieved, it is unsupported or not indexed; returning null]
{"took":39,"columns":[{"name":"count(*)","type":"long"},{"name":"method","type":"text"}],"values":[[129,null]]}
at __randomizedtesting.SeedInfo.seed([7C34733CCD9E9B9A:AB344202DE16E96D]:0)
at app//org.elasticsearch.client.RestClient.convertResponse(RestClient.java:347)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
at app//org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esql(AbstractChallengeRestTest.java:290)
at app//org.elasticsearch.datastreams.logsdb.qa.AbstractChallengeRestTest.esqlContender(AbstractChallengeRestTest.java:284)
at app//org.elasticsearch.datastreams.logsdb.qa.StandardVersusLogsIndexModeChallengeRestIT.testEsqlTermsAggregationByMethod(StandardVersusLogsIndexModeChallengeRestIT.java:317)
```
If text fields are not stored, then keyword sub fields can be used to syntesize fields for text parent field.
Yes, the randomized test failure showed that |
Looks like the CI check here is stuck. All PR CI did complete successfully. |
💚 Backport successful
|
…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
…ReaderOperator via BlockSourceReader. (#114903) (#115064) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
…o fully support synthetic source.
…equired stored fields (#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes #115076
…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076
…equired stored fields (#115114) (#115390) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes #115076
…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076
…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076
…ReaderOperator via BlockSourceReader. (elastic#114903) Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields. This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source. Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in elastic#114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
…equired stored fields (elastic#115114) If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source. This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded. This change also reverts the production code changes introduced via elastic#114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed. Closes elastic#115076
Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.
This change includes
_ignored_source
field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.