feat(dataobj): Add query stats collection to the dataobj readers #17128
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
This PR adds support for collecting various statistics when reading data objects. They are the equivalent of the chunk stats we already collect and are merged & contribute to summaries etc. as normal.
I added dataobj specific stats such as pages scanned, pages downloaded and number of batches fetched, as well as various stats we already collect from chunks such as lines processed, post-filter lines etc.
There are several differences in the chunk-like stats: In dataobjects we read partial data in order to match against predicates and then fill in the rest of the line for matching rows. In chunks, we read everything and then decide whether to keep it. This means fields such as "lines scanned" aren't directly equivalent and it's unclear whether to use the pre-predicate row count or the post-predicate row count from dataobj. I opted for pre-, because that covers the work we are doing. The bytes value should give an indicate of if we are doing too much pre- or post- predicate matching.
While I added protos for structured metadata stats, they aren't easy to calculate so I've omitted them for now. The generic reader where the predicate matching happens does not know if a column is metadata or not, and as we don't return non-matching rows to the higher level readers, we can't accurately calculate how many columns are structured metadata and how many are log lines / stream IDs / timestamps. I am happy to remove this field completely but I think we should find a way to implement it.