fix: avoid overflow for large table state #3801

roeap · 2025-10-02T12:04:28Z

Description

When loading the active add files into memory, we concatenate the batches read from the log. For very large logs, we may exceed the admissible size for an individual array, specifically likely for large stats fields.

@rtyler - mind checking out if this fixes the issue we see on large tables? And, do we have an issue for this?

closes: #3767

codecov · 2025-10-02T12:07:14Z

Codecov Report

❌ Patch coverage is 77.50473% with 119 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.37%. Comparing base (c29d637) to head (670a681).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
...tes/core/src/kernel/snapshot/iterators/scan_row.rs	72.39%	64 Missing and 18 partials ⚠️
crates/core/src/kernel/snapshot/mod.rs	86.90%	7 Missing and 4 partials ⚠️
crates/core/src/kernel/snapshot/log_data.rs	87.01%	7 Missing and 3 partials ⚠️
crates/core/src/kernel/snapshot/serde.rs	52.94%	0 Missing and 8 partials ⚠️
crates/core/src/table/state.rs	72.72%	0 Missing and 3 partials ⚠️
crates/core/src/delta_datafusion/table_provider.rs	84.61%	1 Missing and 1 partial ⚠️
crates/core/src/kernel/snapshot/iterators.rs	85.71%	0 Missing and 1 partial ⚠️
crates/core/src/operations/restore.rs	50.00%	0 Missing and 1 partial ⚠️
python/src/lib.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3801      +/-   ##
==========================================
+ Coverage   74.31%   74.37%   +0.06%     
==========================================
  Files         145      145              
  Lines       39441    39482      +41     
  Branches    39441    39482      +41     
==========================================
+ Hits        29309    29365      +56     
+ Misses       8729     8719      -10     
+ Partials     1403     1398       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rtyler · 2025-10-02T13:50:56Z

#3767

rtyler · 2025-10-02T13:55:48Z

I can confirm that this does not exhaust memory on said table:

roeap · 2025-10-04T21:33:29Z

@rtyler - the PR grew a bit in size, but hopefully for good cause. We may now see that we are no longer using as much memory, since we are no longer tracking the serialised stats as part of the file data.

Would you mind confirming? 😄

roeap · 2025-10-04T21:36:26Z

crates/core/src/delta_datafusion/table_provider.rs

+            let mut pruned_batches = Vec::new();
+            let mut mask_offset = 0;
+
+            for batch in &self.snapshot.files {
+                let batch_size = batch.num_rows();
+                let batch_mask = &mask[mask_offset..mask_offset + batch_size];
+                let batch_mask_array = BooleanArray::from(batch_mask.to_vec());
+                let pruned_batch = filter_record_batch(batch, &batch_mask_array)?;
+                if pruned_batch.num_rows() > 0 {
+                    pruned_batches.push(pruned_batch);
+                }
+                mask_offset += batch_size;
+            }
+
+            LogDataHandler::new(&pruned_batches, es.table_configuration()).statistics()


This was not really nice before and unfortunately got a bit less nice.

Longer therm i think we may have to decide if we need additional skipping from datafsuion, or rely on the file skipping in delta-kernel to be selective (we have no reason to believe it would not be :)).

roeap · 2025-10-04T21:38:20Z

crates/core/src/kernel/arrow/engine_ext.rs

        let stats_schema = self.stats_schema()?;
        let stats_schema: ArrowSchema = stats_schema.as_ref().try_into_arrow()?;
-        fields.push(Arc::new(Field::new(
+        fields[stats_idx] = Arc::new(Field::new(


we are now replacing the existing stats field with the stats_parsed rather than amending the parsed ones.

roeap · 2025-10-05T08:42:06Z

crates/core/src/kernel/snapshot/iterators/scan_row.rs

While not entirely clean yet, we aim to isolate processing of the data we get from kernels log replay in this module. Essentially we need to revert when we do when receiving data when we feed it back into a scan / replay.

roeap · 2025-10-05T08:44:08Z

crates/core/src/kernel/snapshot/iterators/scan_row.rs

+    fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
+        let this = self.project();
+        match this.stream.poll_next(cx) {
+            Poll::Ready(Some(Ok(batch))) => match parse_stats_column(&this.snapshot, &batch) {
+                Ok(batch) => Poll::Ready(Some(Ok(batch))),
+                Err(err) => Poll::Ready(Some(Err(err))),
+            },
+            other => other,
+        }
+    }


it seems work has started to support async/streams directly from kernel. As such we start to move some processing onto streams rather that doing it in iterator world.

This should also align well when we work on the datafusion integrations, since there we find the same model.

[citation needed] 😆

There's been a lot of talk that I have heard but I haven't seen any concrete changes, do you have some to link?

roeap · 2025-10-05T08:45:35Z

crates/core/src/kernel/snapshot/iterators.rs

+    pub fn stats(&self) -> Option<String> {
+        let stats = self.stats_parsed()?.slice(self.index, 1);
+        let value = to_json(&stats)
+            .ok()
+            .map(|arr| arr.as_string::<i32>().value(0).to_string());
+        value.and_then(|v| (!v.is_empty()).then_some(v))


we now need to serialise individual fields to get the json stats (for add actions) since we are no longer stacking the stats column.

roeap · 2025-10-05T08:48:49Z

crates/core/src/kernel/snapshot/mod.rs

    ///
    /// A stream of [`LogicalFileView`] objects.
-    pub fn files(
+    pub fn file_views(


renamed this for consistency since it is returning file views after all.

roeap · 2025-10-05T08:55:41Z

crates/core/src/table/state.rs

+            .map(|file| evaluator.evaluate_arrow(file.clone()))
+            .collect::<Result<Vec<_>, _>>()?;
+
+        let result = concat_batches(results[0].schema_ref(), &results)?;


we still concatenate the add actions table. in a follow-up we should also move this to a stream and expose that via record batch readers in python.

Opened #3811 to track this.

Signed-off-by: Robert Pack <[email protected]>

rtyler · 2025-10-05T17:35:32Z

crates/core/Cargo.toml

 itertools = "0.14"
 parking_lot = "0.12"
 percent-encoding = "2"
+pin-project-lite = "^0.2.7"


I'm not sure why this dependency crept back in, I'll just have to remove it again 😆

rtyler · 2025-10-05T17:37:25Z

crates/core/src/kernel/snapshot/iterators/scan_row.rs

+    fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Option<Self::Item>> {
+        let this = self.project();
+        match this.stream.poll_next(cx) {
+            Poll::Ready(Some(Ok(batch))) => match parse_stats_column(&this.snapshot, &batch) {
+                Ok(batch) => Poll::Ready(Some(Ok(batch))),
+                Err(err) => Poll::Ready(Some(Err(err))),
+            },
+            other => other,
+        }
+    }


[citation needed] 😆

There's been a lot of talk that I have heard but I haven't seen any concrete changes, do you have some to link?

roeap requested review from hntd187 and rtyler as code owners October 2, 2025 12:04

github-actions bot added the binding/rust Issues for the Rust crate label Oct 2, 2025

roeap requested a review from ion-elgreco October 2, 2025 12:08

roeap force-pushed the fix/file-concat branch from 4a8c5a0 to 90b6290 Compare October 2, 2025 12:54

roeap enabled auto-merge (squash) October 2, 2025 14:42

roeap force-pushed the fix/file-concat branch from 90b6290 to d19a5eb Compare October 3, 2025 23:21

github-actions bot added the binding/python Issues for the Python package label Oct 4, 2025

roeap linked an issue Oct 4, 2025 that may be closed by this pull request

Delta rs seems to allocate a lot of memory #3371

Closed

roeap disabled auto-merge October 4, 2025 21:33

roeap force-pushed the fix/file-concat branch from 2099050 to 919a715 Compare October 5, 2025 08:28

roeap commented Oct 5, 2025

View reviewed changes

roeap added 4 commits October 5, 2025 18:13

fix: track log files as vec of batches

5837fb5

Signed-off-by: Robert Pack <[email protected]>

feat: process log repaly on stream

630ba26

Signed-off-by: Robert Pack <[email protected]>

fix: track single representation of file stats delta-io#3371

69ff9d7

Signed-off-by: Robert Pack <[email protected]>

refactor: better log data processing isolation

670a681

Signed-off-by: Robert Pack <[email protected]>

roeap force-pushed the fix/file-concat branch from 919a715 to 670a681 Compare October 5, 2025 16:14

rtyler approved these changes Oct 5, 2025

View reviewed changes

rtyler merged commit aad4499 into delta-io:main Oct 5, 2025
38 of 39 checks passed

roeap deleted the fix/file-concat branch October 5, 2025 17:40

roeap mentioned this pull request Oct 7, 2025

Leverage partitionValues_parsed from arrow state to report partition values #2771

Closed

roeap mentioned this pull request Oct 17, 2025

Error decoding field 'stats' when creating checkpoint #2743

Closed

Uh oh!

fix: avoid overflow for large table state #3801

fix: avoid overflow for large table state #3801

Uh oh!

Conversation

roeap commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rtyler commented Oct 2, 2025

Uh oh!

rtyler commented Oct 2, 2025

Uh oh!

roeap commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

roeap commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 2, 2025 •

edited

Loading

roeap commented Oct 4, 2025 •

edited

Loading