-
Couldn't load subscription status.
- Fork 118
feat: Add latest_commit_file field to LogSegment #1364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add latest_commit_file field to LogSegment #1364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. I think we can just be a little more efficient about cloning the paths.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1364 +/- ##
==========================================
+ Coverage 84.87% 84.89% +0.01%
==========================================
Files 113 113
Lines 28796 28923 +127
Branches 28796 28923 +127
==========================================
+ Hits 24442 24553 +111
- Misses 3197 3198 +1
- Partials 1157 1172 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
kernel/src/log_segment/tests.rs
Outdated
| // Should error because there are no commits | ||
| assert_result_error_with_message(result, "LogSegment requires at least one commit"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could break us if we integrate with something like deltasharing where they use us for reads and take out the minimal log segment.
In other words, the full delta log may have commit.jsons, but delta-sharing gives us a single checkpoint pre-signed URL for a read.
This really should fail for writes if ICT is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just assert that log_segment.latest_commit.is_none()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a case like this:
00000000000000000000.json
00000000000000000001.checkpoint.parquet
=> this has an empty latest_commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting point, so looking around it seems Kernel Java also fails if we don't have a commit available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with leaving it empty and failing on ICT write. But we should then make LogSegment last_commit_file optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. lgtm once we add the tests that oussama requested.
| writer.close()?; | ||
|
|
||
| store | ||
| store_3a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to be here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, i think its a typo in the original code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh weird that this wasn't caught. cool 👍
| if let Some(commit_file) = ascending_commit_files.last() { | ||
| latest_commit_file = Some(commit_file.clone()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're both Option, so we can just cloned.
| if let Some(commit_file) = ascending_commit_files.last() { | |
| latest_commit_file = Some(commit_file.clone()); | |
| } | |
| latest_commit_file = ascending_commit_files.last().cloned(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to clone it if its empty. With 1.json, 1.checkpoint when we get to this part of the code ascending_commits is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like someone in the future could break it very easily. can we make this the else condition on if let Some((_, complete_checkpoint)) = group_checkpoint_parts(new_checkpoint_parts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to what I had before, but @nicklan mentioned that it would be setting the latest_commit at every commit unnecessarily then, instead of only setting it at checkpoints and at the end.
I'll clarify the comments.
| let latest_commit_file = | ||
| new_latest_commit_file.or_else(|| old_log_segment.latest_commit_file.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need tests for try_new_from changes right?
What changes are proposed in this pull request?
Add
latest_commitfield toLogSegmentto prepare for in-commit timestamp (ICT) support. When ICT is enabled, we need access to the latest commit file to read timestamp information. We cannot depend on ascending commits because commits will be filtered out by checkpoint processing when checkpoint version equals commit version.Changes
latest_commit_file: Option<ParsedLogPath>field toLogSegment(optional)latest_commit_file: Option<ParsedLogPath>field toListedLogFiles(optional)latest_commitfrom existing snapshot when new listing is emptyHow was this change tested?
Added and changed tests to verify latest commit is passed through