feat: Add row tracking support #1375

nicklan · 2025-10-07T17:08:30Z

What changes are proposed in this pull request?

Add support for row id columns
Includes arguments for scanning examples to request them
Removes StaticReplace as a transform. the original intention was that it would be used for row ids, but we can't know the expression statically because the base row id changes for each file.

How was this change tested?

Have run on tables with row ids
unit tests

To reviewers: I can add a new table with rowIds enabled to do an e2e test, not sure if we want to keep bloating the tables we check into the repo though.

had to comment out some tests

codecov · 2025-10-07T17:10:28Z

Codecov Report

❌ Patch coverage is 96.47651% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.78%. Comparing base (9a9f28a) to head (dafe973).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/scan/state_info.rs	95.59%	14 Missing and 3 partials ⚠️
kernel/src/scan/log_replay.rs	96.29%	1 Missing and 2 partials ⚠️
kernel/src/transforms.rs	99.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1375      +/-   ##
==========================================
+ Coverage   84.61%   84.78%   +0.17%     
==========================================
  Files         117      118       +1     
  Lines       29936    30286     +350     
  Branches    29936    30286     +350     
==========================================
+ Hits        25330    25678     +348     
  Misses       3382     3382              
- Partials     1224     1226       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kernel/src/transforms.rs

OussamaSaoudi · 2025-10-10T00:01:59Z

kernel/src/scan/mod.rs

+                                .metadata()
+                                .configuration()
+                                .get("delta.rowTracking.materializedRowIdColumnName")
+                                .ok_or(Error::generic("No delta.rowTracking.materializedRowIdColumnName key found in metadata configuration"))?;


Are we not supporting generated row ids on purpose? From the protocol:

delta.rowTracking.materializedRowIdColumnName key in the configuration of the table's metaData action. This column may contain null values meaning that the corresponding row has no materialized Row ID.

see my comment below, we handle the null case

kernel/src/transforms.rs

kernel/src/scan/mod.rs

Broken, checkpoint before re-arch

nicklan · 2025-10-11T00:01:34Z

kernel/src/scan/mod.rs

    log_replay::SCAN_ROW_SCHEMA.clone()
 }

-/// All the state needed to process a scan.


all moved into state_info.rs

DrakeLin

looking good, small nits

DrakeLin · 2025-10-14T21:07:37Z

kernel/src/scan/field_classifiers.rs

-pub(crate) struct ScanTransformFieldClassifier;
-impl TransformFieldClassifier for ScanTransformFieldClassifier {
+// Empty classifier, always returns None
+impl TransformFieldClassifier for () {


We probably want to update the TransformFieldClassifier description to describe its optional nature now.

Not quite sure where you mean to update? I think the docs already say it's optional?

/// Trait for classifying fields during StateInfo construction. /// Allows different scan types (regular, CDF) to customize field handling. pub(crate) trait TransformFieldClassifier {

^ Update it clarify that general field handling happens in StateInfo?

kernel/src/scan/state_info.rs

DrakeLin · 2025-10-14T21:17:20Z

kernel/src/scan/state_info.rs

+        predicate: Option<PredicateRef>,
+        classifier: C,
+    ) -> DeltaResult<Self> {
+        let partition_columns = table_configuration.metadata().partition_columns();


Is it necessary to pull these out into individual variables?

Yeah, I removed partition_columns since we only use that once. But we use column_mapping_mode twice and it's net more lines of code to not pull it out since it's more characters to get to it.

Actually with a refactor I put it back in because it ends up being fewer overall lines

DrakeLin · 2025-10-14T21:33:36Z

kernel/src/scan/state_info.rs

+                        if table_configuration.table_properties().enable_row_tracking != Some(true)
+                        {
+                            return Err(Error::unsupported(
+                                "Row ids are not enabled on this table",
+                            ));
+                        }


Should we do this check on line 59 above where we run through the metadata columns and set
if let Some(MetadataColumnSpec::RowIndex) = metadata_column.get_metadata_column_spec() ?

Also to clarify, can we have row index without row tracking support?

We can have row-index without row-tracking yeah. Row index is just a parquet reader feature to return the index of the row in the file

DrakeLin · 2025-10-14T21:34:26Z

kernel/src/scan/state_info.rs

+                            .get("delta.rowTracking.materializedRowIdColumnName")
+                            .ok_or(Error::generic("No delta.rowTracking.materializedRowIdColumnName key found in metadata configuration"))?;
+
+                        // we can `take` as we should only have one RowId col


Should we validate in the loop in line 58?

Validate which? The config key? We need to get it out here no matter what so seems fine to validate here or we're looking it up twice (or storing it in a variable I guess)

Yep, I think we should have a function validate_metadata_columns that does the following:

Ensure materializedRowIdColumnName is present and extracts it

Ensures that row tracking is enabled if metadata columns are present.

Ensures partition columns don't conflict with with metadata columns.

EDIT: the goal is to keep this inner loop as simple as possible.

Ah also extract any state that's needed during the loop^

Mmm, I see what you're saying. But as we add more metadata cols that might need to extract more information and use it (like row_id_col) the return from that function is going to be a mess of optionals, which I don't love. That's why I choose to keep the validation and state extraction here.

OussamaSaoudi

flushing comments.

OussamaSaoudi · 2025-10-16T16:32:02Z

kernel/examples/common/src/lib.rs

                    )))
            });
-            Schema::try_from_results(selected_fields).map(Arc::new)
+            let schema = Schema::try_from_results(selected_fields);


This all seems like it could be simplified like so:

// Use table schema by default let mut table_schema = snapshot.schema(); // Project columns if let Some(columns) = args.columns.as_ref() { let cols: Vec<&str> = cols.split(",").map(str::trim).collect(); table_schema = table_schema.project_as_struct(&cols); } // Add row index column if args.with_row_index { schema.add_metadata_column("_metadata.row_index", MetadataColumnSpec::RowIndex)} } // Add row id column if args.with_row_id { schema.add_metadata_column("_metadata.row_index", MetadataColumnSpec::RowIndex) }

Yeah good call. I should have just given up on keeping the schema Optional :)

OussamaSaoudi · 2025-10-16T16:36:43Z

kernel/src/scan/log_replay.rs

+    let file_constant_values = StructType::new_unchecked([
+        StructField::nullable("partitionValues", partition_values),
+        StructField::nullable("baseRowId", DataType::LONG),
+    ]);


Note: my async prototype would also let us avoid a lot this static schema munging since the schema can be inferred from a plan.

OussamaSaoudi · 2025-10-16T16:38:08Z

kernel/src/table_changes/physical_to_logical.rs

+        transform_spec,
+        partition_values,
+        physical_schema,
+        None, /* base_row_id */


Can you add a TODO issue to get the base_row_id for CDF?

@DrakeLin this also indicates that we may want a is_cdf_supported(TableFeature::RowTracking)

Does CDF support row tracking? what are the semantics in that case?

ah you're right:
https://docs.databricks.com/aws/en/delta/row-tracking#limitations

I think this is a bit odd tho. It's like not being able to access your primary key during a CDC. In any case, let's block now. Good callout 👍

Can you add a note to get_cdf_transform_expr that makes it clear that we do not support reading row tracking info during CDF and that this is a known delta limitation? I want to avoid losing this context :)

kernel/src/scan/log_replay.rs

kernel/src/transforms.rs

OussamaSaoudi · 2025-10-16T17:01:26Z

kernel/src/transforms.rs

+        assert!(get_transform_expr(
+            &transform_spec,
+            metadata_values,
+            &physical_schema,
+            None, /* base_row_id */
+        )
+        .is_err());


pls use assert_result_error_with_message so we ensure we get the expected error

OussamaSaoudi · 2025-10-16T18:32:34Z

kernel/src/scan/state_info.rs

+                if logical_field.is_metadata_column() {
+                    return Err(Error::Schema(format!(
+                        "Metadata column names must not match partition columns: {}",
+                        logical_field.name()
+                    )));
+                }
+                // push the transform for this partition column


This can be done as a check above in for metadata_column in logical_schema.metadata_columns()

OussamaSaoudi · 2025-10-16T18:34:19Z

kernel/src/scan/state_info.rs

+                        if table_configuration.table_properties().enable_row_tracking != Some(true)
+                        {
+                            return Err(Error::unsupported(
+                                "Row ids are not enabled on this table",
+                            ));
+                        }


This check can be done above. We do the following:

for metadata_column in logical_schema.metadata_columns() { if let Some(MetadataColumnSpec::RowIndex) = metadata_column.get_metadata_column_spec() { selected_row_index_col_name = Some(metadata_column.name().to_string()); } metadata_field_names.insert(metadata_column.name()); }

Let's factor that out and do all our row tracking checks (including partition columns).

I want to make this inner loop very simple and clear.

OussamaSaoudi · 2025-10-16T18:38:51Z

kernel/src/scan/state_info.rs

+                            .get("delta.rowTracking.materializedRowIdColumnName")
+                            .ok_or(Error::generic("No delta.rowTracking.materializedRowIdColumnName key found in metadata configuration"))?;
+
+                        // we can `take` as we should only have one RowId col


Yep, I think we should have a function validate_metadata_columns that does the following:

Ensure materializedRowIdColumnName is present and extracts it

Ensures that row tracking is enabled if metadata columns are present.

Ensures partition columns don't conflict with with metadata columns.

EDIT: the goal is to keep this inner loop as simple as possible.

kernel/src/scan/state_info.rs

DrakeLin

lgtm

DrakeLin · 2025-10-17T22:26:56Z

kernel/src/scan/field_classifiers.rs

-pub(crate) struct ScanTransformFieldClassifier;
-impl TransformFieldClassifier for ScanTransformFieldClassifier {
+// Empty classifier, always returns None
+impl TransformFieldClassifier for () {


/// Trait for classifying fields during StateInfo construction. /// Allows different scan types (regular, CDF) to customize field handling. pub(crate) trait TransformFieldClassifier {

^ Update it clarify that general field handling happens in StateInfo?

OussamaSaoudi

Looks good, just various cleanups and small comments.

While the unit tests are good, I would advocate that we don't publish any kernel releases until we add some integration tests validate row_index/row_id.

OussamaSaoudi · 2025-10-22T00:31:33Z

kernel/tests/read.rs

+        let scan = snapshot.scan_builder().with_schema(schema).build();
+        match scan {
+            Err(e) => {
+                let error_msg = e.to_string();
+                assert!(
+                    error_msg.contains(error_text),
+                    "Expected {error_msg} to contain {error_text}"
+                );
+            }
+            Ok(_) => {
+                panic!(
+                    "Expected error for {} metadata column, but scan succeeded",
+                    error_text
+                );
            }
        }


Nit: I think there may be an unwrap_error

nice, that's useful

OussamaSaoudi · 2025-10-22T01:04:22Z

kernel/src/table_changes/physical_to_logical.rs

+        transform_spec,
+        partition_values,
+        physical_schema,
+        None, /* base_row_id */


ah you're right:
https://docs.databricks.com/aws/en/delta/row-tracking#limitations

I think this is a bit odd tho. It's like not being able to access your primary key during a CDC. In any case, let's block now. Good callout 👍

OussamaSaoudi · 2025-10-22T02:39:50Z

kernel/src/scan/state_info.rs

+        match get_state_info(
+            schema.clone(),
+            vec!["part_col".to_string()],
+            None,
+            HashMap::new(),
+            vec![("part_col", MetadataColumnSpec::RowId)],
+        ) {
+            Ok(_) => {
+                panic!("Should not have succeeded generating state info with invalid config")
+            }
+            Err(e) => {
+                assert_eq!(e.to_string(),
+                           "Schema error: Metadata column names must not match partition columns: part_col")
+            }
+        }


Suggested change

match get_state_info(

schema.clone(),

vec!["part_col".to_string()],

None,

HashMap::new(),

vec![("part_col", MetadataColumnSpec::RowId)],

) {

Ok(_) => {

panic!("Should not have succeeded generating state info with invalid config")

}

Err(e) => {

assert_eq!(e.to_string(),

"Schema error: Metadata column names must not match partition columns: part_col")

}

}

let res = get_state_info(

schema.clone(),

vec!["part_col".to_string()],

None,

HashMap::new(),

vec![("part_col", MetadataColumnSpec::RowId)],

);

assert_result_error_with_message(

res,

"Schema error: Metadata column names must not match partition columns: part_col"

);

OussamaSaoudi · 2025-10-22T02:40:47Z

kernel/src/scan/state_info.rs

+        match get_state_info(
+            schema.clone(),
+            vec![],
+            None,
+            get_string_map(&[("delta.columnMapping.mode", "name")]),
+            vec![("other", MetadataColumnSpec::RowIndex)],
+        ) {
+            Ok(_) => {
+                panic!("Should not have succeeded generating state info with invalid config")
+            }
+            Err(e) => {
+                assert_eq!(e.to_string(),
+                           "Schema error: Metadata column names must not match physical columns, but logical column 'id' has physical name 'other'");
+            }
+        }
+    }


Suggested change

match get_state_info(

schema.clone(),

vec![],

None,

get_string_map(&[("delta.columnMapping.mode", "name")]),

vec![("other", MetadataColumnSpec::RowIndex)],

) {

Ok(_) => {

panic!("Should not have succeeded generating state info with invalid config")

}

Err(e) => {

assert_eq!(e.to_string(),

"Schema error: Metadata column names must not match physical columns, but logical column 'id' has physical name 'other'");

}

}

}

let res = get_state_info(

schema.clone(),

vec![],

None,

get_string_map(&[("delta.columnMapping.mode", "name")]),

vec![("other", MetadataColumnSpec::RowIndex)],

);

assert_result_error_with_message(

res,

"Schema error: Metadata column names must not match physical columns, but logical column 'id' has physical name 'other'"

);

}

OussamaSaoudi · 2025-10-22T02:42:17Z

kernel/src/scan/state_info.rs

+            match get_state_info(schema.clone(), vec![], None, metadata_config, metadata_cols) {
+                Ok(_) => {
+                    panic!("Should not have succeeded generating state info with invalid config")
+                }
+                Err(e) => {
+                    assert_eq!(
+                        e.to_string(),
+                        expected_error,
+                    )
+                }
+            }
+        }


Suggested change

match get_state_info(schema.clone(), vec![], None, metadata_config, metadata_cols) {

Ok(_) => {

panic!("Should not have succeeded generating state info with invalid config")

}

Err(e) => {

assert_eq!(

e.to_string(),

expected_error,

)

}

}

}

let res = get_state_info(schema.clone(), vec![], None, metadata_config, metadata_cols);

assert_result_error_with_message(res, expected_error);

}

OussamaSaoudi · 2025-10-22T02:50:49Z

kernel/src/scan/state_info.rs

+    /// What are the names of the requested metadata fields
+    metadata_field_names: HashSet<&'a String>,
+    /// The name of the column that's selecting row indexes if that's been requested or None if they
+    /// are not requested .  We remember this if it's been requested explicitly. this is so we can


Suggested change

/// are not requested . We remember this if it's been requested explicitly. this is so we can

/// are not requested. We remember this if it's been requested explicitly. This is so we can

OussamaSaoudi · 2025-10-22T02:55:54Z

kernel/src/table_changes/physical_to_logical.rs

+        transform_spec,
+        partition_values,
+        physical_schema,
+        None, /* base_row_id */


Can you add a note to get_cdf_transform_expr that makes it clear that we do not support reading row tracking info during CDF and that this is a known delta limitation? I want to avoid losing this context :)

nicklan · 2025-10-22T18:42:02Z

While the unit tests are good, I would advocate that we don't publish any kernel releases until we add some integration tests validate row_index/row_id.

Thanks, added #1417 to track

nicklan added 5 commits October 3, 2025 15:18

working! just need to add back a test

7aa1d7f

Merge branch 'main' into row-tracking-take-1

d263f2a

had to comment out some tests

cleanup, add back tests

397af57

handle selecting row index and row id

aa68b80

initial tests

a0f11a6

github-actions bot assigned nicklan Oct 7, 2025

nicklan added 7 commits October 7, 2025 11:23

fix clippy

4d27d6d

Merge branch 'main' into row-tracking-take-1

e1931dd

finish up state info tests

723ab1a

clippy

bffa4e8

add one transform test

515ee7c

Add log_replay transform test

be7d60d

Merge branch 'main' into row-tracking-take-1

975d2b6

nicklan marked this pull request as ready for review October 8, 2025 23:37

fmt

6e96dc2

nicklan requested review from DrakeLin, OussamaSaoudi and scovich October 8, 2025 23:38

OussamaSaoudi reviewed Oct 10, 2025

View reviewed changes

nicklan added 9 commits October 10, 2025 14:37

Merge branch 'main' into row-tracking-take-1

e46cb15

Broken, checkpoint before re-arch

move StateInfo into its own module

b91e50d

working, needs simplification

798abbb

cleanup

2f1cba7

cleanup

693ab5e

add some more tests

8ef9487

Merge branch 'main' into row-tracking-take-1

07b84af

unneeded mod path

d707a5f

address comment

fa9c6b1

nicklan requested a review from OussamaSaoudi October 11, 2025 00:06

remove unneeded

0d57b88

nicklan commented Oct 11, 2025

View reviewed changes

more coverage

6100e6f

DrakeLin reviewed Oct 14, 2025

View reviewed changes

nicklan requested review from DrakeLin and removed request for scovich October 14, 2025 23:20

OussamaSaoudi reviewed Oct 16, 2025

View reviewed changes

nicklan added 8 commits October 16, 2025 17:38

address comment

cae7eaf

Merge branch 'main' into row-tracking-take-1

3a43190

simplify schema logic

1e36f00

better name

2822bb8

move partition col check out of loop

a4cbedc

consolidate assertions on the row id transform

8f25b5e

comments

194cb98

factor out validate_metadata_columns

8f0ff20

nicklan requested a review from OussamaSaoudi October 17, 2025 21:18

DrakeLin approved these changes Oct 20, 2025

View reviewed changes

nicklan added 2 commits October 21, 2025 16:58

Merge branch 'main' into row-tracking-take-1

e451940

update comment

502a967

nicklan force-pushed the row-tracking-take-1 branch from a1d2a98 to 502a967 Compare October 22, 2025 00:10

OussamaSaoudi approved these changes Oct 22, 2025

View reviewed changes

nicklan added 2 commits October 22, 2025 11:38

address final comments

6c744bc

Merge branch 'main' into row-tracking-take-1

dafe973

nicklan changed the title ~~Add row tracking support~~ feat: Add row tracking support Oct 22, 2025

nicklan merged commit 3efdae7 into delta-io:main Oct 22, 2025
22 checks passed

	/// are not requested . We remember this if it's been requested explicitly. this is so we can
	/// are not requested. We remember this if it's been requested explicitly. This is so we can

Uh oh!

feat: Add row tracking support #1375

feat: Add row tracking support #1375

Uh oh!

Conversation

nicklan commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrakeLin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OussamaSaoudi Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OussamaSaoudi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nicklan commented Oct 7, 2025 •

edited

Loading

codecov bot commented Oct 7, 2025 •

edited

Loading

OussamaSaoudi Oct 16, 2025 •

edited

Loading

OussamaSaoudi Oct 16, 2025 •

edited

Loading