Skip to content

[GH-1918] Spark 4 support #1919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 25, 2025
Merged

[GH-1918] Spark 4 support #1919

merged 34 commits into from
Jun 25, 2025

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Apr 13, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [GH-XXX] my subject.

Resolves #1918

What changes were proposed in this PR?

Add support for Spark 4.

This required several updates:

  • A new profile is added for Spark 4
  • The spark/common module has source directories for Spark 3 and 4 respectively. I had to do it in the common module because things in the common module depend on the version specific shims. The main breaking changes that required this are:
    • Column objects are no longer wrappers around Expression objects, but a new ColumnNode construct for Spark Connect support. Supporting the expression wrapping requires a different setup. Initially I started working on this through reflection, but this got pretty messy and this will require different artifacts anyway, so I added the conditional source directories.
    • Creating a DataFrame from an RDD has to use the new location of the "classic" DataFrame class.
    • The NullIntolerant trait no longer exists, instead it's a an overridable function on an expression
  • jt-jiffle-language and it's antlr dependency have to be shaded into the common module for Spark 4 to work. This is because in antlr 4.10 there was some internal version bump such that dependencies compiled with antlr < 4.10 can't run at runtime with >= 4.10. I think jt-jiffle-language has an Apache license so I think this is ok? Currently it's a provided dependency that comes with the external geotools-wrapper. But need some verification here or thoughts on any alternative approach.
  • I copied the spark-3.5 module as is to spark-4.0. The only changes I had to make were to the new Arrow UDF stuff that was added recently. Could these also just be moved as conditional source directories in spark/common?
  • DBSCAN tests are ignored on Spark 4 because the current graphframes dependency does not support Spark 4. I've been messing around with getting graphframes updated as well.

How was this patch tested?

Existing UTs.

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

Maybe supported versions need to change? Haven't looked at the docs yet.

@jiayuasu jiayuasu requested a review from Kontinuation April 14, 2025 21:34
@@ -44,7 +44,7 @@ jobs:
- name: Compile JavaDoc
run: mvn -q clean install -DskipTests && mkdir -p docs/api/javadoc/spark && cp -r spark/common/target/apidocs/* docs/api/javadoc/spark/
- name: Compile ScalaDoc
run: mvn scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark
run: mvn generate-sources scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only way I could figure out to get the scala docs to be aware of the additional source directory

/**
* A physical plan that evaluates a [[PythonUDF]].
*/
case class SedonaArrowEvalPythonExec(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This arrow eval is the only thing I had to update from the spark-3.5 module to the spark-4.0 module due to some API changes. It looks like starting in 4.1 they added support for UDTs in arrow UDFs

common/pom.xml Outdated
Comment on lines 147 to 148
<!-- We need to shade jiffle and it's antlr dependency because Spark 4 uses an
incompatible version of antlr at runtime. -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we shade it in geotools-wrapper so that no dependency reduced pom will be generated when building sedona-common? @jiayuasu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely needs to be shaded locally for the tests to work. I'm not 100% sure if the release could just be shaded into geotools-wrapper or not. My concern was if you somehow have jiffle as a separate dependency, those classes would be used with the provided antlr and not the relocated antlr dependency

@Kimahriman
Copy link
Contributor Author

Working on getting the Python CI to work, realized I never updated that.

In the mean time, do we want to increase min versions for other things to reduce the testing matrix? Specifically maybe dropping Spark 3.3 and Python 3.7 support?

@jiayuasu
Copy link
Member

jiayuasu commented May 23, 2025

@Kimahriman

fine by me. Feel free to create a PR to drop 3.3 support (and code), and remove Python 3.7 from the test matrix.

@Kimahriman
Copy link
Contributor Author

Kimahriman commented May 24, 2025

Ok got Python tests working, just required a few more updates:

  • Spark 4 requires Pandas 2, so had to upgrade the Pipfile dependency.
  • Had to also disable DBSCAN python tests
  • Had to fix jiffle being double-shaded into the spark-shaded module

Now that Spark 4 has been officially released, I think this is ready. The jiffle/antlr shading is the main outstanding question. I think the cleanest approach is to just shade it directly into sedona-common, but I'm not a shading or license expert by any means.

@Kimahriman Kimahriman marked this pull request as ready for review May 24, 2025 11:36
@Kimahriman Kimahriman requested a review from jiayuasu as a code owner May 24, 2025 11:36
@Kimahriman
Copy link
Contributor Author

I'm on leave right now so won't be able to update off latest master for a couple weeks at least, if anyone else wants to get this finished up. Only outstanding thing before a new release for spark 4 support would be the graphframes dbscan issue. I started trying to work updating that as well but haven't had time to finish up yet and not sure how much progress any of the other people working on graphframes have made

@james-willis
Copy link
Collaborator

I'm taking a look at graphframes for spark 4

@james-willis
Copy link
Collaborator

Should we merge this? @Kimahriman is getting close on graphframes 4.0 support as well so we will have DBSCAN unblocked soon as well.

@Kimahriman
Copy link
Contributor Author

Yeah I think we should merge this so it doesn't get behind again, can undo the DBSCAN test changes once a new graphframes release comes out hopefully soonish 🤞

@jiayuasu jiayuasu added this to the sedona-1.8.0 milestone Jun 25, 2025
@jiayuasu jiayuasu merged commit 0cc1521 into apache:master Jun 25, 2025
35 checks passed
@jiayuasu
Copy link
Member

Thank you for the hard work @Kimahriman !

@Kimahriman
Copy link
Contributor Author

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

@james-willis
Copy link
Collaborator

ok im talking to sem about this now. i would like to avoid making a breaking change to our API if possible.

@SemyonSinchenko
Copy link
Member

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

@Kimahriman would such a solution (graphframes/graphframes#620) be OK for you? tldr: setting conf("spark.graphframes.useLabelsAsComponents", "false") will preserve the generated long ID.

@Kimahriman
Copy link
Contributor Author

That sounds fine by me, I think it should be doable to set a temporary SQL conf around the call to connected components

@james-willis
Copy link
Collaborator

I will probably use the sedona session configurator to default this to true in sedona applications as long as it isnt explicitly set. Ill also add support for returning strings in case this conf is set to true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spark 4 support
5 participants