[RFC] feat!: kernel based log replay - take 2 #3474

roeap · 2025-05-26T12:40:19Z

Description

This is a redo of #3137.

Since the first attempt, kernel has evolved quite a bit, as has our codebase. We also learned a lot, particularly around some of the complexities of "nested" async evaluation that comes with kernels engine concepts.

A few things that could guide designs:

Use Engine: Anything we can do using kernels Engine abstractions, we can do without the need to have datafusion enabled. Potentially significantly extending what we can offer to arrow-only users.
Expose idiomatic async: kernel exposes blocking iterators that only "appear to be sync". We need to figure out how to best handle these. Kernel may one day also expose apis or utilities to more seamlessly integrate with rust async so we should hide these complexities from end users.

Since this will (and should) have a significant impact on large parts of the codebase, we should not even try to do a once-off switch. Rather we propose the following migration strategy.

Define a new trait Snapshot that exposes a minimal API we require in operations planning etc.
Implement this trait for the current snapshot implementations as well as the newly introduced kernel-backed snapshots.
Refactor individual operations / code-paths to accept this new trait

This should allow us to incrementally "kernelize" our codebase and give us the opportunity to properly test the new snapshots. Opt-in to using new snapshots should for now be done via either feature flags or (maybe even better) runtime configuration.

Still figuring out some basics, but putting it up anyway for feedback.

Major breaking changes

A lot of APIs will be broken eventually, but listing some of the more fundamental changes.

No more tombstones

Currently we expose tombstones as part of our state / snapshots. Since Checkpoint writing uses kernel now, we really only need that to plan vacuums, which is a "special" or at least table format specific maintenance operation. As such, we internalise the logic how a vacuum is planned.

github-actions · 2025-05-26T12:40:42Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

codecov · 2025-05-26T12:50:29Z

Codecov Report

Attention: Patch coverage is 62.60054% with 279 lines in your changes missing coverage. Please review.

Project coverage is 74.13%. Comparing base (7092f64) to head (b038e2a).
Report is 25 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/core/src/kernel/snapshot_next/mod.rs	60.22%	51 Missing and 19 partials ⚠️
crates/core/src/kernel/snapshot_next/legacy.rs	0.00%	65 Missing ⚠️
crates/core/src/kernel/snapshot_next/lazy.rs	73.02%	29 Missing and 29 partials ⚠️
crates/core/src/kernel/snapshot_next/iterators.rs	36.92%	41 Missing ⚠️
crates/core/src/kernel/snapshot_next/eager.rs	67.90%	16 Missing and 10 partials ⚠️
crates/core/src/kernel/snapshot_next/stream.rs	78.78%	14 Missing ⚠️
crates/core/src/kernel/snapshot_next/arrow_ext.rs	91.80%	0 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3474      +/-   ##
==========================================
- Coverage   74.33%   74.13%   -0.21%     
==========================================
  Files         150      157       +7     
  Lines       45033    45765     +732     
  Branches    45033    45765     +732     
==========================================
+ Hits        33476    33928     +452     
- Misses       9401     9616     +215     
- Partials     2156     2221      +65

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

roeap · 2025-05-26T14:02:25Z

crates/core/src/kernel/snapshot_next/lazy.rs

+    fn logical_files_stream(&self, predicate: Option<PredicateRef>) -> SendableRBStream {
+        let scan = match self
+            .inner
+            .clone()
+            .scan_builder()
+            .with_predicate(predicate)
+            .build()
+        {
+            Ok(scan) => scan,
+            Err(err) => {
+                return Box::pin(futures::stream::once(async {
+                    Err(DeltaTableError::KernelError(err))
+                }))
+            }
+        };
+
+        // TODO: which capacity to choose?
+        let mut builder = RecordBatchReceiverStreamBuilder::new(100);
+        let tx = builder.tx();
+
+        let engine = self.engine.clone();
+        builder.spawn_blocking(move || {
+            let mut scan_iter = scan.scan_metadata_arrow(engine.as_ref())?;
+            for res in scan_iter {
+                let batch = res?.scan_files;
+                if tx.blocking_send(Ok(batch)).is_err() {
+                    break;
+                }
+            }
+            Ok(())
+        });
+
+        builder.build()
+    }


@zachschuermann - any input / opinion on this?

ion-elgreco · 2025-05-26T16:25:10Z

crates/core/src/kernel/snapshot_next/arrow_ext.rs

+    /// - `Some(expr)`: Apply this expression to transform the data to match [`Scan::schema()`].
+    /// - `None`: No transformation is needed; the data is already in the correct logical form.
+    ///
+    /// Note: This vector can be indexed by row number.


Is the order of this vec guaranteed?

Yes, its an invariant that rows in the batch and entries in the vecotor are aligned.

ion-elgreco · 2025-05-26T16:35:15Z

crates/core/src/kernel/snapshot_next/arrow_ext.rs

+    let scan_file_transforms = metadata
+        .scan_file_transforms
+        .into_iter()
+        .enumerate()
+        .filter_map(|(i, v)| metadata.scan_files.selection_vector[i].then_some(v))
+        .collect();
+    let batch = ArrowEngineData::try_from_engine_data(metadata.scan_files.data)?.into();
+    let scan_files = filter_record_batch(
+        &batch,
+        &BooleanArray::from(metadata.scan_files.selection_vector),
+    )?;
+    Ok(ScanMetadataArrow {
+        scan_files,
+        scan_file_transforms,
+    })
+}


This kernel to arrow seems to filter to active add actions, right?

Small note would help, I had to look up the delta-kernel-rs docs to understand it ^^

ion-elgreco · 2025-05-26T16:37:44Z

crates/core/src/kernel/snapshot_next/eager.rs

+    fn logical_files(&self, predicate: Option<PredicateRef>) -> SendableRBStream {
+        if let Some(predicate) = predicate {
+            self.snapshot.logical_files(Some(predicate))
+        } else {
+            let batch = self.files.clone();
+            return Box::pin(futures::stream::once(async move { Ok(batch) }));
+        }
+    }


I don't quite follow this, why if you pass a predicate, you filter the lazySnapshot?

Thats probably b/c it is not finished :).

What needs to happen here: The snapshot can be created with a predicate. If there is an existing predicate we need to validate that the new predicate would skip all the files the current predicate did skip. This is non trivial logic, so for now well likely jsut allow no existing predicate ... this check will eventually be integrated in kernel.

SO if there is one, we will be able to replay using the existing data - but only if the predicate is valid.

ion-elgreco · 2025-05-26T16:41:11Z

crates/core/src/kernel/snapshot_next/eager.rs

+        let scan = snapshot.inner.clone().scan_builder().build()?;
+        let engine = snapshot.engine_ref().clone();
+        // TODO: process blocking iterator
+        let files: Vec<_> = scan
+            .scan_metadata_from_arrow(
+                engine.as_ref(),
+                current,
+                Box::new(std::iter::once(self.files.clone())),
+                self.predicate.clone(),
+            )?
+            .map_ok(|s| s.scan_files)
+            .try_collect()?;
+
+        self.files = concat_batches(&files[0].schema(), &files)?;


Does scan_builder re-use the existing state it has?

scan_metadata_from does exactly that. It will treat the existing data as if it were a snapshot as of the current version. Depending on what it finds internally it will re-use that data. (if there is a new snapshot, it will currently do a new log repay, but this can be improved in the future.

ion-elgreco · 2025-05-26T16:43:10Z

crates/core/src/kernel/snapshot_next/iterators.rs

+    /// Size of the file in bytes.
+    pub fn size(&self) -> i64 {
+        self.files
+            .column(1)


My preference would be to do column_by_name, I find column ordering more risky then incorrect naming : P, it;s also more explicit

Agreed, locally we are already computing the indices for the published (and validated) schema of the scan row.

ion-elgreco · 2025-05-26T20:52:49Z

crates/core/src/kernel/snapshot_next/lazy.rs

+        // TODO: which capacity to choose?
+        let mut builder = RecordBatchReceiverStreamBuilder::new(100);
+        let tx = builder.tx();


Why do we use these channels and not async iterators?

It's very difficult I guess to define the channel size

what do we mean with async iterators? streams?

Would a stream by any other name be as fast? Streams are basically channels which are basically async Iterators., Functionally it is something which has poll_next on it, channels are typically what we call something which allows cross task/thread streams, which is usually why these things have to be bounded.

My recommendation would be to define an environment value and default, e.g.

Suggested change

// TODO: which capacity to choose?

let mut builder = RecordBatchReceiverStreamBuilder::new(100);

let tx = builder.tx();

let mut builder = RecordBatchReceiverStreamBuilder::new(std::env::var("RECORD_BATCH_STREAM").unwrap_or(1024));

let tx = builder.tx();

ion-elgreco · 2025-05-26T20:57:07Z

crates/core/src/kernel/snapshot_next/lazy.rs

+        let table_url = if let Some(op_id) = operation_id {
+            #[allow(deprecated)]
+            log_store.transaction_url(op_id, &log_store.table_root_url())?
+        } else {
+            log_store.table_root_url()


We don't really need the transaction url for reading tbf

ion-elgreco · 2025-05-26T20:59:39Z

crates/core/src/kernel/snapshot_next/stream.rs

+/// Trait for types that stream [RecordBatch]
+///
+/// See [`SendableRecordBatchStream`] for more details.
+pub trait RecordBatchStream: Stream<Item = DeltaResult<RecordBatch>> {


We already have a similar trait in arrow-rs, why didn't you use that?

github-actions · 2025-07-04T11:34:45Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Signed-off-by: Robert Pack <[email protected]>

github-actions bot added the binding/rust Issues for the Rust crate label May 26, 2025

roeap force-pushed the feat/kernel-replay branch from fffab44 to 01a007f Compare May 26, 2025 13:58

roeap commented May 26, 2025

View reviewed changes

roeap force-pushed the feat/kernel-replay branch from 01a007f to 4a46458 Compare May 26, 2025 15:51

ion-elgreco reviewed May 26, 2025

View reviewed changes

roeap force-pushed the feat/kernel-replay branch 2 times, most recently from e872fa2 to 25434bd Compare May 29, 2025 08:56

roeap force-pushed the feat/kernel-replay branch from 25434bd to 3cda533 Compare July 4, 2025 11:34

roeap force-pushed the feat/kernel-replay branch 3 times, most recently from cd7afcc to 7a0bf1a Compare July 6, 2025 09:27

feat: kernel snapshot abstractions

b038e2a

Signed-off-by: Robert Pack <[email protected]>

roeap force-pushed the feat/kernel-replay branch from 7a0bf1a to b038e2a Compare July 6, 2025 20:31

rtyler closed this Jul 23, 2025

roeap deleted the feat/kernel-replay branch August 16, 2025 22:10

Uh oh!

[RFC] feat!: kernel based log replay - take 2 #3474

[RFC] feat!: kernel based log replay - take 2 #3474

Uh oh!

Conversation

roeap commented May 26, 2025

Description

Major breaking changes

No more tombstones

Uh oh!

github-actions bot commented May 26, 2025

Uh oh!

codecov bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ion-elgreco May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented May 26, 2025 •

edited

Loading

ion-elgreco May 26, 2025 •

edited

Loading