Skip to content

[GH-1918] Spark 4 support #1919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from
Draft

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Apr 13, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject.

What changes were proposed in this PR?

Add support for Spark 4.

This required several updates:

  • A new profile is added for Spark 4
  • The spark/common module has source directories for Spark 3 and 4 respectively. I had to do it in the common module because things in the common module depend on the version specific shims. The main breaking changes that required this are:
    • Column objects are no longer wrappers around Expression objects, but a new ColumnNode construct for Spark Connect support. Supporting the expression wrapping requires a different setup. Initially I started working on this through reflection, but this got pretty messy and this will require different artifacts anyway, so I added the conditional source directories.
    • Creating a DataFrame from an RDD has to use the new location of the "classic" DataFrame class.
    • The NullIntolerant trait no longer exists, instead it's a an overridable function on an expression
  • jt-jiffle-language and it's antlr dependency have to be shaded into the common module for Spark 4 to work. This is because in antlr 4.10 there was some internal version bump such that dependencies compiled with antlr < 4.10 can't run at runtime with >= 4.10. I think jt-jiffle-language has an Apache license so I think this is ok? Currently it's a provided dependency that comes with the external geotools-wrapper. But need some verification here or thoughts on any alternative approach.
  • I copied the spark-3.5 module as is to spark-4.0. The only changes I had to make were to the new Arrow UDF stuff that was added recently. Could these also just be moved as conditional source directories in spark/common?
  • DBSCAN tests are ignored on Spark 4 because the current graphframes dependency does not support Spark 4. I've been messing around with getting graphframes updated as well.

How was this patch tested?

Existing UTs.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

Maybe supported versions need to change? Haven't looked at the docs yet.

@jiayuasu jiayuasu requested a review from Kontinuation April 14, 2025 21:34
@@ -44,7 +44,7 @@ jobs:
- name: Compile JavaDoc
run: mvn -q clean install -DskipTests && mkdir -p docs/api/javadoc/spark && cp -r spark/common/target/apidocs/* docs/api/javadoc/spark/
- name: Compile ScalaDoc
run: mvn scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark
run: mvn generate-sources scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only way I could figure out to get the scala docs to be aware of the additional source directory

/**
* A physical plan that evaluates a [[PythonUDF]].
*/
case class SedonaArrowEvalPythonExec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arrow eval is the only thing I had to update from the spark-3.5 module to the spark-4.0 module due to some API changes. It looks like starting in 4.1 they added support for UDTs in arrow UDFs

Comment on lines +147 to +148
<!-- We need to shade jiffle and it's antlr dependency because Spark 4 uses an
incompatible version of antlr at runtime. -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we shade it in geotools-wrapper so that no dependency reduced pom will be generated when building sedona-common? @jiayuasu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely needs to be shaded locally for the tests to work. I'm not 100% sure if the release could just be shaded into geotools-wrapper or not. My concern was if you somehow have jiffle as a separate dependency, those classes would be used with the provided antlr and not the relocated antlr dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants