Skip to content

feat(dataobj): Add query stats collection to the dataobj readers #17128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 14, 2025

Conversation

benclive
Copy link
Contributor

What this PR does / why we need it:
This PR adds support for collecting various statistics when reading data objects. They are the equivalent of the chunk stats we already collect and are merged & contribute to summaries etc. as normal.

I added dataobj specific stats such as pages scanned, pages downloaded and number of batches fetched, as well as various stats we already collect from chunks such as lines processed, post-filter lines etc.

There are several differences in the chunk-like stats: In dataobjects we read partial data in order to match against predicates and then fill in the rest of the line for matching rows. In chunks, we read everything and then decide whether to keep it. This means fields such as "lines scanned" aren't directly equivalent and it's unclear whether to use the pre-predicate row count or the post-predicate row count from dataobj. I opted for pre-, because that covers the work we are doing. The bytes value should give an indicate of if we are doing too much pre- or post- predicate matching.

While I added protos for structured metadata stats, they aren't easy to calculate so I've omitted them for now. The generic reader where the predicate matching happens does not know if a column is metadata or not, and as we don't return non-matching rows to the higher level readers, we can't accurately calculate how many columns are structured metadata and how many are log lines / stream IDs / timestamps. I am happy to remove this field completely but I think we should find a way to implement it.

@benclive benclive requested a review from a team as a code owner April 11, 2025 13:50
Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, but need to fix some tests

@benclive benclive force-pushed the instrument-dataobj-reader-stats branch from 8ccdf54 to e46bf85 Compare April 14, 2025 13:19
@benclive benclive merged commit 9e68fb0 into main Apr 14, 2025
61 checks passed
@benclive benclive deleted the instrument-dataobj-reader-stats branch April 14, 2025 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants