Skip to content

[Spark] Delta Connect python client implementation + tests (ported from the branch-4.0-preview) #4514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

allisonport-db
Copy link
Collaborator

@allisonport-db allisonport-db commented May 8, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Based off of #4513

Adds the Delta Connect python client from branch-4.0-preview1 to master since we now are able to run our python tests using the latest Spark 4.0 RC.

How was this patch tested?

Does this PR introduce any user-facing changes?

@allisonport-db allisonport-db force-pushed the run-python-tests-spark-master-2025-delta-connect branch from c584603 to a6e07bc Compare May 8, 2025 17:57
@allisonport-db allisonport-db force-pushed the run-python-tests-spark-master-2025-delta-connect branch from a6e07bc to 770cce1 Compare May 8, 2025 20:29
# TODO: In the future, find a way to get these
# packages locally instead of downloading from Maven.
delta_connect_packages = ["com.google.protobuf:protobuf-java:3.25.1",
"org.apache.spark:spark-connect_2.13:4.0.0-preview1",
Copy link
Contributor

@longvu-db longvu-db May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"org.apache.spark:spark-connect_2.13:4.0.0-preview1",
"org.apache.spark:spark-connect_2.13:4.0.0-preview1",

Could we change this in all places to latest RC4, like build.sbt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah also good catch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not realize we manually add deps in build.sbt for specific release 🐙 updated them now

setup.py Outdated
else: # MAJOR_VERSION >= 4
# Delta 4.0+ contains Delta Connect code and uses Spark 4.0+
packages_arg = ['delta', 'delta.connect', 'delta.connect.proto']
install_requires_arg = ['pyspark>=4.0.0.dev1', 'importlib_metadata>=1.0.0']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only dev1? Do we have the latest RC or dev2?

https://pypi.org/project/pyspark/#history

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this should be 4.0.0; The CI installs 4.0.0 and not the preview but this would be the wrong requirement

@allisonport-db allisonport-db force-pushed the run-python-tests-spark-master-2025-delta-connect branch from 75638dd to 0bb8355 Compare May 9, 2025 18:07
allisonport-db added a commit that referenced this pull request May 9, 2025
)

<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [X] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [ ] Other (fill in here)

## Description

Currently we don't run our python tests for our Spark master build since
there is no nightly pyspark snapshot, this PR runs those python tests
using the latest RC for the Spark 4.0 release. When future RCs are
released (or the release published) we will use those instead.

This unblocks the Delta Connect python client development in master,
which previously was only merged to the `branch-4.0-preview1` branch
since we could not run the python tests in master.

#4514 is based off of this one and
adds the python delta connect client.

## How was this patch tested?

CI tests.
pipenv run pip install protobuf==5.29.1
pipenv run pip install googleapis-common-protos-stubs==2.2.0
pipenv run pip install grpc-stubs==1.24.11
pipenv run pip install https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc6-bin/pyspark-4.0.0.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this from rc4 to rc6 if that's okay, feel free to put it back if it's not

def setUpClass(cls) -> None:
# Spark Connect will set SPARK_CONNECT_TESTING_REMOTE, and it does not allow MASTER
# to be set simultaneously, so we need to clear it.
if "MASTER" in os.environ:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_history(self):
pass

@unittest.skip("cdc has not been implemented yet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test_cdc, test_addFeatureSupport and test_dropFeatureSupport, they are the latest additions of https://github.com/delta-io/delta/blob/master/python/delta/tests/test_deltatable.py

@@ -74,8 +75,15 @@ jobs:
pipenv run pip install pydocstyle==3.0.0
pipenv run pip install pandas==2.2.0
pipenv run pip install pyarrow==11.0.0
pipenv run pip install numpy==1.21
pipenv run pip install https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc4-bin/pyspark-4.0.0.tar.gz
pipenv run pip install pypandoc==1.3.3
Copy link
Contributor

@longvu-db longvu-db May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we pip install 2 different versions of wheel (0.43.0 and 0.33.4), right above

# Spark Connect (required)
grpcio>=1.67.0
grpcio-status>=1.67.0
googleapis-common-protos>=1.65.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
googleapis-common-protos>=1.65.0
googleapis-common-protos>=1.65.0
protobuf==5.29.1

https://github.com/apache/spark/blob/6f8fa6b08ff30c41c108fdc1e7af69befcc6915c/dev/requirements.txt#L64C1-L64C17

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done later

Copy link
Contributor

@longvu-db longvu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@allisonport-db allisonport-db changed the title [DO_NOT_MERGE] Test adding Delta Connect python client [Spark] Delta Connect python client implementation + tests (ported from the branch-4.0-preview) May 15, 2025
@allisonport-db allisonport-db merged commit b7ff92a into delta-io:master May 15, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants