Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Support incremental compute for partition stats #12629
base: main
Are you sure you want to change the base?
Core: Support incremental compute for partition stats #12629
Changes from 3 commits
4f973a3
9eef4c9
1a5a463
0fe332d
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want this as a default predicate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could add it as default filter:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
While computing incremental, I observed that it may become duplicate counts. So, I added.
I do have some gaps, I need to understand fully when and all we mark manifest entry as existing.
Is there any scenario exist to consider "existing" entries or just "added" is enough?
There is another check down below, that considers both added and existing (added long back).
iceberg/core/src/main/java/org/apache/iceberg/PartitionStatsUtil.java
Line 103 in d54d81e
I will update the code to just keep added entry and also add a testcase of rewrite data files to ensure stats are same after the rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, looks like
ManifestFile
can have both added and existing entries together? So, Instead of filtering here. I will keep filtering just at the entries level down below incollectStatsForManifest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we have compaction and expire snapshots? new manifests would have the EXISTING entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do with the stats of the removed files?
Lets say:
What happens with the stats in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compaction doesn't remove the data. if we expire S1 and S2 we don't have prev snapshots/stats and start fresh (i.e. full compute)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't expire data, could we detect that S3 is only a compaction commit, and the stats don't need to be changed?
What if S3 instead is a MoW commit? Can we detect the changes and calculate stats incrementally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compaction will be tested end to end while adding the spark procedure.
For full compute, old manifest files will be marked as deleted and entries will be reused as existing in the manifest files + may have additional added entry. So, for full compute need to consider both existing and added.
For incremental compute, old stats file has some entires which are now existing. So, should consider the existing entires.
This all leads to the next question, what happens when manifest is deleted. That case we just update the snapshot entry (last modified) and not decrement the stats. Hence, we should skip it for incremental compute again.
All these logic present in
collectStatsForManifest
and existing testcases (full compute and incremental) covers it as it usesmergeAppend
which produces manifest mix of added and existing entires.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't need decrement stats for full compute because we were discarding the deleted manifests. Only considering live manifests.
Now, I am not really sure for compaction, the current code will work. We may need decrement stats just for incremental compute. I will test compaction scenario tomorrow and handle this.