Skip to content

Conversation

@KaiqiJinWow
Copy link
Collaborator

@KaiqiJinWow KaiqiJinWow commented May 20, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR is trying to unblock RowTracking feature in Kernel, which includes

  1. Support client passing the rowIdHighWatermark when creating the transaction. We would do enough check to make sure that all rows would not exceed this provided high watermark, then we would set this row id high watermark in this txn.
  2. Allow setting the these table properties delta.rowTracking.materializedRowIdColumnName and delta.rowTracking.materializedRowCommitVersionColumnName.
  3. Allow passing baseRowId and defaultRowCommitVersion when generate IcebergCompatWriterV1AddAction and IcebergCompatWriterV1RemoveAction

How was this patch tested?

Existing and newly added unit tests.

Does this PR introduce any user-facing changes?

@KaiqiJinWow KaiqiJinWow changed the title [WIP][Kernel] Support enabling RowTracking [Kernel] Support enabling RowTracking May 20, 2025
@KaiqiJinWow KaiqiJinWow self-assigned this May 20, 2025
@KaiqiJinWow KaiqiJinWow requested review from allisonport-db and vkorukanti and removed request for allisonport-db and vkorukanti May 20, 2025 22:45
@KaiqiJinWow KaiqiJinWow changed the title [Kernel] Support enabling RowTracking [WIP][Kernel] Support enabling RowTracking May 21, 2025
boolean dataChange,
Map<String, String> tags) {
Map<String, String> tags,
Optional<Long> baseRowId,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update once the IcebergCompatWriterV3 introduced. We only allow passing these row tracking infos on generateIcebergCompatWriterV3AddAction.

@KaiqiJinWow KaiqiJinWow force-pushed the update_kernel_row_id branch from ed1fe1a to d8dc4e4 Compare June 10, 2025 17:56
@KaiqiJinWow KaiqiJinWow force-pushed the update_kernel_row_id branch from d8dc4e4 to dc802e7 Compare June 10, 2025 18:14
Copy link
Contributor

@lzlfred lzlfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR only

public static final MaterializedRowTrackingColumn ROW_ID =
new MaterializedRowTrackingColumn(
TableConfig.MATERIALIZED_ROW_ID_COLUMN_NAME, "_row-id-col-");
TableConfig.MATERIALIZED_ROW_ID_COLUMN_NAME, "_row-id-col-", 2147483540L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a reference to iceberg spec

stat ->
fieldMap.put(
RemoveFile.FULL_SCHEMA.indexOf("stats"), stat.serializeAsJson(physicalSchema)));
baseRowId.ifPresent(id -> fieldMap.put(RemoveFile.FULL_SCHEMA.indexOf("baseRowId"), id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a counter part of this for create AddFile ?

TableConfig.MATERIALIZED_ROW_COMMIT_VERSION_COLUMN_NAME, "_row-commit-version-col-");
TableConfig.MATERIALIZED_ROW_COMMIT_VERSION_COLUMN_NAME,
"_row-commit-version-col-",
2147483539L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we always apply this id regardless of iceberg compat ? I think we should only do this for v3.

});

if (currRowIdHighWatermark.get() != prevRowIdHighWatermark) {
// If the client has explicitly provided a row ID high watermark, we should use that value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: client is ambiguous... should it be transaction builder ?

@KaiqiJinWow KaiqiJinWow changed the title [WIP][Kernel] Support enabling RowTracking [Kernel] Support enabling RowTracking Jun 25, 2025
});

if (currRowIdHighWatermark.get() != prevRowIdHighWatermark) {
// If the client has explicitly provided a row ID high watermark, we should use that value
Copy link
Collaborator

@johanl-db johanl-db Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KaiqiJinWow
Can you provide more context on this change, I'm not sure I see why a client would need to provide an explicit high-water mark instead of relying on kernel automatically raising it on write

In particular:

  • how is this going to be used. I assume this has to do with Iceberg compat, that would be useful to cover in the PR description / title
  • Where is the provided high-water mark coming from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi This PR is staled, please check the merged one #4856

@KaiqiJinWow KaiqiJinWow closed this Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants