Skip to content

Fix read option name for change data feed for Delta [databricks]#13532

Merged
jihoonson merged 4 commits into
NVIDIA:branch-25.10from
jihoonson:fix-cdf-read-tests
Oct 2, 2025
Merged

Fix read option name for change data feed for Delta [databricks]#13532
jihoonson merged 4 commits into
NVIDIA:branch-25.10from
jihoonson:fix-cdf-read-tests

Conversation

@jihoonson

@jihoonson jihoonson commented Sep 29, 2025

Copy link
Copy Markdown
Collaborator

Fixes #12796

Description

We are currently using "readChangeDataFeed" as a read option name to read a table with change data feed in our tests. However, this option name is invalid. We should use "readChangeFeed" instead. This PR fixes the option name and adds asserts to validate the schema of the data read.

I manually ran delta_lake_delete_test.py, delta_lake_update_test.py, and delta_lake_merge_test.py, which are impacted by this change. delta_lake_low_shuffle_merge_test.py is also impacted as well, but it does not run with Delta 3.3. I used Spark 3.5.5 and Delta 3.3.0 for my testing.

Checklists

  • This PR has added documentation for new or modified features or behaviors.
  • This PR has added new tests or modified existing tests to cover new code paths.
    (Please explain in the PR description how the new code paths are tested, such as names of the new/existing tests that cover them.)
  • Performance testing has been performed and its results are added in the PR description. Or, an issue has been filed with a link in the PR description.

Copilot AI review requested due to automatic review settings September 29, 2025 18:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an incorrect option name used when reading Delta tables with change data feed (CDF) functionality. The option name was changed from the invalid "readChangeDataFeed" to the correct "readChangeFeed", and schema validation assertions were added to verify the presence of required CDF columns.

  • Fixed incorrect option name from "readChangeDataFeed" to "readChangeFeed" for Delta CDF reads
  • Added schema validation assertions to ensure CDF columns are present in the returned DataFrame
  • Improved code formatting and added explanatory comment about dropping the commit timestamp column

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Jihoon Son <ghoonson@gmail.com>
@jihoonson jihoonson added bug Something isn't working test Only impacts tests labels Sep 29, 2025
@jihoonson

Copy link
Copy Markdown
Collaborator Author

build

@gerashegalov

Copy link
Copy Markdown
Collaborator

Does not delta_lake_write_test.py
require a change?

https://github.com/search?q=repo%3ANVIDIA%2Fspark-rapids%20readChangeDataFeed&type=code

@gerashegalov

Copy link
Copy Markdown
Collaborator

We should run CI with [databricks] to see if this was related to DBR

@jihoonson

Copy link
Copy Markdown
Collaborator Author

Does not delta_lake_write_test.py require a change?

https://github.com/search?q=repo%3ANVIDIA%2Fspark-rapids%20readChangeDataFeed&type=code

Good catch! Modified them to use the util function as well. Thanks.

@jihoonson

Copy link
Copy Markdown
Collaborator Author

We should run CI with [databricks] to see if this was related to DBR

I'm confused. Does pre-merge CI run delta tests if [databricks] is tagged?

@jihoonson jihoonson changed the title Fix read option name for change data feed for Delta Fix read option name for change data feed for Delta [databricks] Sep 29, 2025
@jihoonson

Copy link
Copy Markdown
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes Sep 29, 2025

@gerashegalov gerashegalov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jihoonson

Copy link
Copy Markdown
Collaborator Author
2025-09-29T22:19:45.1931386Z [2025-09-29T22:16:13.789Z] [2025-09-29T22:13:22.679Z] 2025-09-29 22:13:22,465 [Thread-5] ERROR org.apache.spark.sql.connect.service.SparkConnectService - Could not start **** Connect GRPC service
2025-09-29T22:19:45.1931684Z [2025-09-29T22:16:13.789Z] [2025-09-29T22:13:22.679Z] java.io.IOException: Failed to bind to address ****/****:15002
...
2025-09-29T22:19:45.1952170Z [2025-09-29T22:16:13.790Z] [2025-09-29T22:13:22.680Z] Caused by: grpc_shaded.io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address already in use

The blossom-ci failure looks like some intermittent issue. Re-triggering it.

@jihoonson

Copy link
Copy Markdown
Collaborator Author

build

@gerashegalov

Copy link
Copy Markdown
Collaborator

We should run CI with [databricks] to see if this was related to DBR

I'm confused. Does pre-merge CI run delta tests if [databricks] is tagged?

I assumed yes based on
https://github.com/NVIDIA/spark-rapids/blob/a1e36ec956299220a2c2a001b42dc65e894ade47/jenkins/databricks/test.sh#L137-L141

but now that I see runit.sh I am not sure which script is run when. cc @pxLi ?

@jihoonson

Copy link
Copy Markdown
Collaborator Author

build

@jihoonson

Copy link
Copy Markdown
Collaborator Author

I xfailed the failing low shuffle merge tests and filed #13552 to fix them. @gerashegalov could you please have another look?

@gerashegalov gerashegalov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jihoonson jihoonson merged commit 0fa20b1 into NVIDIA:branch-25.10 Oct 2, 2025
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working test Only impacts tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Delta CDF tests potentially use an incorrect read option

4 participants