Skip to content

Conversation

@zachschuermann
Copy link
Member

@zachschuermann zachschuermann commented Nov 27, 2024

This PR enables incremental snapshot updates. This is done with a new Snapshot::try_new_from(...) which takes an Arc<Snapshot> and an optional version (None = latest version) to incrementally create a new snapshot from the existing one. The heuristic is as follows:

  1. if the new version == existing version, just return the existing snapshot
  2. if the new version < existing version, error since the engine shouldn't really be here
  3. list from (existing checkpoint version + 1, or version 1 if no checkpoint) onward (create a new 'incremental' LogSegment)
  4. if no new commits/checkpoint, return existing snapshot (if requested version matches), else create new LogSegment
  5. check for a checkpoint:
    a. if new checkpoint is found: just create a new snapshot from that checkpoint (and commits after it)
    b. if no new checkpoint is found: do lightweight P+M replay on the latest commits

In addition to the 'main' Snapshot::try_new_from() API, the following incremental APIs were introduced to support the above implementation:

  1. TableConfiguration::try_new_from(...)
  2. splitting LogSegment::read_metadata() into LogSegment::read_metadata() and LogSegment::protocol_and_metadata()
  3. new LogSegment.checkpoint_version field

resolves #489

@zachschuermann
Copy link
Member Author

zachschuermann commented Nov 27, 2024

curious if anyone has naming thoughts! EDIT: landed on try_new_from()

@zachschuermann zachschuermann requested review from OussamaSaoudi-db, nicklan and scovich and removed request for nicklan and scovich November 27, 2024 20:52
@codecov
Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 92.07459% with 34 lines in your changes missing coverage. Please review.

Project coverage is 84.80%. Comparing base (9d36e25) to head (cb06927).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/snapshot.rs 89.96% 4 Missing and 26 partials ⚠️
kernel/src/table_changes/mod.rs 66.66% 0 Missing and 2 partials ⚠️
kernel/src/log_segment.rs 96.29% 1 Missing ⚠️
kernel/src/table_configuration.rs 98.95% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #549      +/-   ##
==========================================
+ Coverage   84.65%   84.80%   +0.15%     
==========================================
  Files          83       83              
  Lines       19839    20247     +408     
  Branches    19839    20247     +408     
==========================================
+ Hits        16795    17171     +376     
- Misses       2221     2225       +4     
- Partials      823      851      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general question I have: What should a listing optimization even look like for a snapshot refresh? If the snapshot is not very old, then we should just LIST to find new commit .json after the end of the current segment, and not even try to find new checkpoints. Quick, easy.

Also, the "append new deltas" approach is friendly to the "partial P&M query" optimization, which is only applicable if we have a contiguous chain of commits back to the previous snapshot version -- a newer checkpoint would actually force us to do the full P&M query all over, which for a large checkpoint could be annoying.

On the other hand, if there is a newer checkpoint available, then data skipping will be more efficient if we use it (fewer jsons to replay serially and keep track of). This is especially true if a lot of versions have landed since the original snapshot was taken.

Problem is, there's no way to know in advance whether the snapshot is "stale" because it's by number of versions that land, not elapsed time.

Complicated stuff...

existing_snapshot: &Snapshot,
engine: &dyn Engine,
version: Option<Version>,
) -> DeltaResult<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the method should take+return Arc<Snapshot> so we have the option to return the same snapshot if we determine it is still fresh?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe even do

pub fn refresh(self: &Arc<Self>, ...) -> DeltaResult<Arc<Self>>

(this would have slightly different intuition than new_from -- refresh specifically assumes I want a newer snapshot, if available, and attempting to request an older version may not even be legal; I'm not sure if it would even make sense to pass an upper bound version for a refresh operation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified to take + return Arc<Snapshot> but i've avoided calling it refresh since that feels to me like implying mutability? I'm in favor of new_from since that's saying you get a new snapshot but just 'from' an older one. let me know if you agree with that thinking!

let start_snapshot =
Snapshot::try_new(table_root.as_url().clone(), engine, Some(start_version))?;
let end_snapshot = Snapshot::try_new(table_root.as_url().clone(), engine, end_version)?;
let end_snapshot = Snapshot::new_from(&start_snapshot, engine, end_version)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens an interesting question... if we knew that new_from would reuse the log checkpoint and just "append" any new commit .json files to the log segment, then we could almost (**) reuse that log segment for the CDF replay by just stripping out its checkpoint files? But that's pretty CDF specific; in the normal case we want a refresh to use the newest checkpoint available because it makes data skipping log replay cheaper. Maybe the CDF case needs a completely different way of creating the end_snapshot, unrelated to this optimization here.

(**) Almost, because the start version might have a checkpoint, in which case stripping the checkpoint out of the log segment would also remove the start version. But then again, do we actually want the older snapshot to be the start version? Or the previous version which the start version is making changes to? Or, maybe we should just restrict the checkpoint search to versions before the start version, specifically so that this optimization can work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually want the older snapshot to be the start version?

It would be sufficient to have the older snapshot be start_version-1 as long as we also have access to the commit at start_version. With these, we would start P&M at start_version then continue it on the older snapshot if we don't find anything.

I guess this would look like: snapshot(start_version-1).refresh_with_commits(end_version)

After all, the goal of the start_snapshot is just to ensure that CDF is enabled.


/// Create a new [`Snapshot`] instance from an existing [`Snapshot`]. This is useful when you
/// already have a [`Snapshot`] lying around and want to do the minimal work to 'update' the
/// snapshot to a later version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, is this api only for versions later than the existing snapshot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep for now proposing that we allow old snapshot but just return a new snapshot (no incrementalization) maybe warn! in that case? or i suppose we could disallow that..?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any valid scenario where a caller could legitimately pass a newer snapshot than the one they're asking for? I guess time travel? But if they know they're time traveling why would they pass a newer snapshot in the first place?

Either way, we should publicly document whether a too-new starting snapshot is an error or merely a useless hint, so callers don't have to wonder.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any optimization that's available (let alone useful) if the caller passes in a new snapshot as the hint.

If that's true, then the question is: do we prohibit this behavior or just let it degenerate to the usual try_new the client should have done anyways?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for returning an error in that case. It's unlikely the engine meant to get into that situation, so let's let them know they are doing something wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to be an error now! i agree :)

@zachschuermann zachschuermann changed the title feat: new Snapshot::new_from() API feat: Snapshot::new_from() API Mar 10, 2025
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Mar 17, 2025
@zachschuermann zachschuermann force-pushed the snapshot-from-snapshot branch from 309f1ad to 5480711 Compare March 18, 2025 16:41
@zachschuermann
Copy link
Member Author

A general question I have: What should a listing optimization even look like for a snapshot refresh?

@scovich for now (after brief chat with @roeap) i propose doing a simple heuristic based on the presence of a checkpoint and we can take on further optimization in the future. The heuristic is:

  1. if the new version < existing version, just return an entirely new snapshot
  2. if the new version == existing version, just return the existing snapshot
  3. list from existing snapshot version
    a. if new checkpoint is found: just create a new snapshot from that checkpoint (and commits after it)
    b. if no new checkpoint is found: do lightweight P+M replay on the latest commits to incrementally update the Snapshot

@zachschuermann zachschuermann requested a review from roeap March 18, 2025 16:49
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beleive we may need to merge LogSegments in the case when no checkpint is contained in the incremental log slice.


/// Create a new [`Snapshot`] instance from an existing [`Snapshot`]. This is useful when you
/// already have a [`Snapshot`] lying around and want to do the minimal work to 'update' the
/// snapshot to a later version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any valid scenario where a caller could legitimately pass a newer snapshot than the one they're asking for? I guess time travel? But if they know they're time traveling why would they pass a newer snapshot in the first place?

Either way, we should publicly document whether a too-new starting snapshot is an error or merely a useless hint, so callers don't have to wonder.

Some(v) if v < existing_snapshot.version() => {
Self::try_new(existing_snapshot.table_root().clone(), engine, version).map(Arc::new)
}
Some(v) if v == existing_snapshot.version() => Ok(existing_snapshot.clone()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: I'd put this one first?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if a match is really that helpful here, especially given that LogSegment::for_versions needs to handle the no-change case?

let old_version = existing_snapshot.version();
if let Some(new_version) = version {
    if new_version == old_version {
        // Re-requesting the same version
        return Ok(existing_snapshot.clone());
    }
    if new_version > old_version {
        // Hint is too new, just create a new snapshot the normal way
        return Self.:try_new(...).map(Arc::new);
    }
}
    
// Check for new commits
let (mut new_ascending_commit_files, checkpoint_parts) =
    list_log_files_with_version(fs_client, &log_root, Some(start_version), end_version)?;

if new_ascending_commit_files.is_empty() {
    // No new commits, just return the same snapshot
    return Ok(existing_snapshot.clone());
}

if !checkpoint_parts.is_empty() {
    // We found a checkpoint, so just create a new snapshot the normal way
    return Self::try_new(...).map(Arc::new);
}    

// Append the new commits to the existing LogSegment
let checkpoint_parts = existing_snapshot.log_segment.checkpoint_parts.clone();
let mut ascending_commit_files = existing_snapshot.log_segment.ascending_commit_files.clone();
ascending_commit_files.extend(new_ascending_commit_files);
let new_log_segment = LogSegment::try_new(
    ascending_commit_files, 
    checkpoint_parts, 
    log_root,
    version,
);

Avoids the indirection and complexity of building a suffix log segment... but then we don't have an easy way to do the incremental P&M :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with this some and refactored. critically, I'm still leveraging a LogSegment, but we have a new Error::EmptyLogSegment that we can specifically check for. I like the idea of (1) still using LogSegment and (2) having this error capture the empty case without having to modify semantics of LogSegment. BUT i dislike having to introduce a new pub Error variant. I didn't do the leg work to have a private error here - wanted to gather some feedback on overall approach first

}
}
let (mut ascending_commit_files, checkpoint_parts) =
list_log_files_with_version(fs_client, &log_root, Some(start_version), end_version)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the table has not changed? I somehow doubt LogSegment::try_new would like the empty file listing that results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New EmptyLogSegment error that we can explicitly leverage (instead of having to change LogSegment semantics). I just dislike it being pub... see other comment above.

let new_log_segment = LogSegment::for_versions(
fs_client.as_ref(),
log_root,
existing_snapshot.version() + 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we incrementally update frequently, we would never see any new checkpoints because those always land with some delay after the commit. Should we LIST from the existing snapshot's checkpoint version + 1 in order to detect and take advantage of new checkpoints that landed in the existing log segment? It would be pretty easy to construct the new log segment in that case, just take the existing log segment's checkpoint and the listing's commit files.

We could also get fancy and only look for checkpoints in some recent window, so that we only pick up a new checkpoint if the one we know about is too many commits behind?

We could still do the same incremental P&M either way, it just takes a bit more care handling the commit lists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this definitely sounds reasonable, though i'm inclined to do something 'simple'(ish) here (as long as we aren't blocking such an optimization) and track this as a follow-up?

@zachschuermann zachschuermann changed the title feat: Snapshot::new_from() API feat: Snapshot::try_new_from() API Mar 21, 2025
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks

/// We implement a simple heuristic:
/// 1. if the new version == existing version, just return the existing snapshot
/// 2. if the new version < existing version, error: there is no optimization to do here
/// 3. list from (existing snapshot version + 1) onward
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing #549 (comment) from github oblivion...

If we incrementally update frequently, we would never see any new checkpoints because those always land with some delay after the commit.

With the current code arrangement, I believe it would be quite simple to reliably pick up new checkpoints:

The difference in file counts returned by the LIST should be small enough that the overall cost is still dominated by the network round trip.

NOTE: In its DeltaLog::update method, delta-spark has always listed from existing checkpoint version in its method, and lacks the incremental P&M optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sounds good - this ended up being a reasonable change. In the case we don't have a checkpoint I list from the end of existing snapshot commits (like we used to). Note that it means

  1. we check for commit files being the same as well as being empty (same = case of list from checkpoint, empty = list from end of commits and there are no new commits)
  2. I introduced a simple checkpoint_version API for LogSegment (could just leverage simple checkpoint_parts.first().version etc. but this seemed cleaner/generally useful)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(need to add some more tests though)

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Mar 28, 2025
@zachschuermann zachschuermann requested a review from scovich March 28, 2025 19:35
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should probably just address https://github.com/delta-io/delta-kernel-rs/pull/549/files#r2018928167 before merging, since it should be very few LoC change?

Comment on lines +130 to +132
// NB: we need to check both checkpoints and commits since we filter commits at and below
// the checkpoint version. Example: if we have a checkpoint + commit at version 1, the log
// listing above will only return the checkpoint and not the commit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this (still) true? What code does the filtering? I thought the filtering happens in the call to LogSegment::try_new, which didn't happen yet? Or does the log listing also filter and the log segment filtering is just a safety check?

Copy link
Member Author

@zachschuermann zachschuermann Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup it looks like in list_log_files_with_version we have:

// [snip]
checkpoint_parts = complete_checkpoint;
commit_files.clear(); // Log replay only uses commits after a complete checkpoint
// [snip]

and again in LogSegment::try_new() (which I moved around a little but was there in the control flow anyways):

// Commit file versions must be greater than the most recent checkpoint version if it exists
let checkpoint_version = checkpoint_parts.first().map(|checkpoint_file| {
    ascending_commit_files.retain(|log_path| checkpoint_file.version < log_path.version);
    checkpoint_file.version
});

Copy link
Member Author

@zachschuermann zachschuermann Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made an issue for my todo comment, but this seems related, adding a bit there: #778

@zachschuermann
Copy link
Member Author

LGTM. We should probably just address https://github.com/delta-io/delta-kernel-rs/pull/549/files#r2018928167 before merging, since it should be very few LoC change?

My github wasn't properly resolving that link but I think this refers to that comment above on list_from(last checkpoint version)? That ended up being a reasonable change and maybe just snuck in before you had refreshed for your last review!

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last review uncovered a few non-trivial things I missed before... consider carefully whether they are best addressed before or after merging.

if let Some(checkpoint_version) = existing_snapshot.log_segment.checkpoint_version {
checkpoint_version + 1
} else {
old_version + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was no checkpoint before, then that means the log segment goes all the way back to commit 0; we should list from 1 in that case, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should probably just list from 0, to eliminate the corner case at L147-150 below?
(then we'd only have to check whether old file list equals new file list)

Update: Superseded by other comments below


/// Return whether or not the LogSegment contains a checkpoint.
pub(crate) fn has_checkpoint(&self) -> bool {
!self.checkpoint_parts.is_empty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
!self.checkpoint_parts.is_empty()
self.checkpoint_version.is_some()

Copy link
Collaborator

@scovich scovich Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or, just do that at the one call site, instead of defining a helper at all)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea realized probably not necessary, though I've kept the 'not is_empty()` since it's a vec

Comment on lines 147 to 150
let has_new_commits = new_ascending_commit_files
== existing_snapshot.log_segment.ascending_commit_files
|| new_ascending_commit_files.is_empty();
if has_new_commits && checkpoint_parts.is_empty() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized there's a corner case here. For a slow-moving table, a metadata cleanup operation could remove a bunch of older commits (and checkpoints), without any new commits being added.

There would be a least one new checkpoints in that case, tho. I think the existing code would handle it correctly, by creating a new snapshot at L183 below? But it wouldn't be optimal, because if there were no new commits we shouldn't need to create a new snapshot in the first place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... It doesn't cost any I/O to create the new log segment, right? Can we just create it and then bail out unless new_log_segment.end_version > old_version? That should handle both commits and checkpoints gracefully, and would also handle the case where Some requested new_version mismatches what was actually found.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this approach we get back to the same issue as before with needing to handle the "no log files" case - a LogSegment is a non-zero length chunk of the log. If we create a zero-length one, then we get an error and need to handle that specifically (which we can do - like I had with an EmptyLogSegment error for the constructor that we can specifically handle) but the current approach was to alleviate that complexity

// if there's new protocol: have to ensure read supported
let protocol = match new_protocol {
Some(protocol) => {
protocol.ensure_read_supported()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're "supporting" some table features (like check constraints) only if not actually used... I think we also need to ensure_read_supported if metadata changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually our ensure_read_supported only depends on the protocol right now it looks like. We have TableConfiguration.ensure_write_supported which is a function of Protocol and Metadata but we don't yet have that for 'read supported' - I wonder if we should go ahead and introduce that abstraction and for now just pass through to protocol.ensure_read_supported?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, because column constraints are only a writer feature... do we not have any reader-writer features whose validity checks depend on metadata? Seems like column mapping needs to validate the schema annotations for example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that I can tell: even for col mapping we just check that the feature is enabled to say that 'reads are supported' then, I think if there is incorrect schema annotations it would fail downstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that validation happens in the TableConfiguration constructor? At least, that's where we originally planned to put it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep you're right - my mistake, lol i already included that in the new code (just forgot about it oops)

in both TableConfiguration::try_new and try_new_from (new code) we do a protocol.ensure_read_supported and a validate_schema_column_mapping - I wonder if we could do better here by modifying the constructor so that we can (1) have someone upstream do the parsing leg work and (2) leverage the constructor directly in try_new_from?

std::cmp::Ordering::Greater => (), // expected
}

if !new_log_segment.checkpoint_parts.is_empty() {
Copy link
Collaborator

@scovich scovich Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: easier to read like this?

Suggested change
if !new_log_segment.checkpoint_parts.is_empty() {
if new_log_segment.checkpoint_version.is_some() {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea!

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few last nits for code readability, but looks great!

Comment on lines 175 to 178
return Err(Error::Generic(format!(
"Unexpected state: The newest version in the log {} is older than the old version {}",
new_end_version, old_version
)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird (lack of) indentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved up a line now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(interesting cargo fmt doesn't seem to have an opinion?)

@zachschuermann zachschuermann removed the breaking-change Change that require a major version bump label Apr 7, 2025
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))]
pub(crate) struct LogSegment {
pub end_version: Version,
pub checkpoint_version: Option<Version>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is causing semver check to fail - i've removed flag since this is pub only if dev-vis is on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also note #810

@zachschuermann zachschuermann merged commit 0186dc4 into delta-io:main Apr 7, 2025
21 checks passed
zachschuermann added a commit to hntd187/delta-kernel-rs that referenced this pull request Apr 8, 2025
This PR enables incremental snapshot updates. This is done with a new
`Snapshot::try_new_from(...)` which takes an `Arc<Snapshot>` and an
optional version (None = latest version) to incrementally create a new
snapshot from the existing one. The heuristic is as follows:
1. if the new version == existing version, just return the existing
snapshot
2. if the new version < existing version, error since the engine
shouldn't really be here
3. list from (existing checkpoint version + 1, or version 1 if no
checkpoint) onward (create a new 'incremental' `LogSegment`)
4. if no new commits/checkpoint, return existing snapshot (if requested
version matches), else create new `LogSegment`
5. check for a checkpoint:
a. if new checkpoint is found: just create a new snapshot from that
checkpoint (and commits after it)
b. if no new checkpoint is found: do lightweight P+M replay on the
latest commits

In addition to the 'main' `Snapshot::try_new_from()` API, the following
incremental APIs were introduced to support the above implementation:
1. `TableConfiguration::try_new_from(...)`
2. splitting `LogSegment::read_metadata()` into
`LogSegment::read_metadata()` and `LogSegment::protocol_and_metadata()`
3. new `LogSegment.checkpoint_version` field

resolves delta-io#489
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New snapshot API for creating snapshots based on existing snapshot

5 participants