Skip to content

[Kernel] Add support for ICT-based time travel (part 1) #4581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
May 23, 2025

Conversation

dhruvarya-db
Copy link
Collaborator

@dhruvarya-db dhruvarya-db commented May 19, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Adds initial support for performing ICT-based time travel. More specifically, this PR:

  1. Updates all code paths that rely on timestamp-based time travel such that they all create a snapshot of the latest table version first. This allows us to determine whether ICT is enabled.
  2. Adds an initial binary-search-based greatest lower bound implementation for time traveling based on ICT. A follow PR will make this more efficient by intelligently utilizing file modification timestamps for narrowing the search window.

How was this patch tested?

Added test cases to DeltaHistoryManagerSuite

Does this PR introduce any user-facing changes?

No

Resolves: #4294

@dhruvarya-db dhruvarya-db changed the title [Kernel][WIP] Add support for ICT-based time travel (part 1) [Kernel] Add support for ICT-based time travel (part 1) May 20, 2025
@dhruvarya-db dhruvarya-db requested a review from scovich May 21, 2025 22:10
@dhruvarya-db
Copy link
Collaborator Author

@scovich I have copied over your comments from https://github.com/delta-io/delta/pull/4483/files#diff-07b47ed7d50294a001f579299ebd0e7d2328a991babcf6ba23af3c9cacdf8cdcR182 to this PR. This PR just focuses on the binary search part. I will add the exponential search incrementally to make the reviews simpler.

if (ictEnablementCommit.getTimestamp() <= timestamp) {
// The target commit is in the ICT range.
long latestSnapshotTimestamp = latestSnapshot.getTimestamp(engine);
if (latestSnapshotTimestamp <= timestamp) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image https://github.com//pull/4483/files#r2101187752

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searchResult =
lastCommitBeforeOrAtTimestamp(commits, timestamp)
.orElse(
commits.get(0)); // This is only returned if canReturnEarliestCommit (see below)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image https://github.com//pull/4483/files#r2101190894

Copy link
Collaborator Author

@dhruvarya-db dhruvarya-db May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to the formatter. I can't find a way to undo this.

* @return An optional which contains a tuple containing the index and the value of the greatest
* lower bound when found, or an empty optional if not found.
*/
public static Optional<Tuple2<Long, Long>> greatestLowerBound(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image https://github.com//pull/4483/files#r2101200771

Copy link
Collaborator Author

@dhruvarya-db dhruvarya-db May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich I have updated the implementation so that it is more explicit about when None is returned. I have also added explicit test cases for greatestLowerBound

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks

Comment on lines 114 to 116
+ "based lookup. This can happen when the commit log is corrupted or when "
+ "there is a parallel operation like metadata cleanup that is deleting "
+ "commits. Please retry the query.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe skip the corruption mention? Metadata cleanup is by far the more likely scenario, because Delta clients don't honor the configured data retention window when resolving time travel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated.

if (enablementTimestampOpt.isPresent() && enablementVersionOpt.isPresent()) {
return new Commit(enablementVersionOpt.get(), enablementTimestampOpt.get());
} else if (!enablementTimestampOpt.isPresent() && !enablementVersionOpt.isPresent()) {
// This means that ICT has been enabled for the entire history.
Copy link
Collaborator

@scottsand-db scottsand-db May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you help me to understand why this is the case? If my table doesn't support ICT --> then won't these fields be empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we have already established that ICT is enabled on the table (L209). When ICT is enabled, this fields being empty indicates that ICT has been enabled for all of available history.

* Get the version of the checkpoint, checksum or delta file. Returns an empty optional if the
* file is not a checkpoint, checksum or delta file.
*/
public static Optional<Long> getFileVersionOpt(Path path) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change? I don't see any usages of getFileVersionOpt --> maybe I'm just missing it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this is not need for part 1. Removed.

Copy link
Collaborator

@scottsand-db scottsand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Thanks!

@vkorukanti vkorukanti merged commit fce3d84 into delta-io:master May 23, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [Kernel] Time travel by timestamp does not take into account ICT
4 participants