[GH-1918] Spark 4 support #1919

Kimahriman · 2025-04-13T00:22:07Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject.

Resolves #1918

What changes were proposed in this PR?

Add support for Spark 4.

This required several updates:

A new profile is added for Spark 4
The spark/common module has source directories for Spark 3 and 4 respectively. I had to do it in the common module because things in the common module depend on the version specific shims. The main breaking changes that required this are:
- Column objects are no longer wrappers around Expression objects, but a new ColumnNode construct for Spark Connect support. Supporting the expression wrapping requires a different setup. Initially I started working on this through reflection, but this got pretty messy and this will require different artifacts anyway, so I added the conditional source directories.
- Creating a DataFrame from an RDD has to use the new location of the "classic" DataFrame class.
- The NullIntolerant trait no longer exists, instead it's a an overridable function on an expression
jt-jiffle-language and it's antlr dependency have to be shaded into the common module for Spark 4 to work. This is because in antlr 4.10 there was some internal version bump such that dependencies compiled with antlr < 4.10 can't run at runtime with >= 4.10. I think jt-jiffle-language has an Apache license so I think this is ok? Currently it's a provided dependency that comes with the external geotools-wrapper. But need some verification here or thoughts on any alternative approach.
I copied the spark-3.5 module as is to spark-4.0. The only changes I had to make were to the new Arrow UDF stuff that was added recently. Could these also just be moved as conditional source directories in spark/common?
DBSCAN tests are ignored on Spark 4 because the current graphframes dependency does not support Spark 4. I've been messing around with getting graphframes updated as well.

How was this patch tested?

Existing UTs.

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation.

Maybe supported versions need to change? Haven't looked at the docs yet.

Kimahriman · 2025-04-16T18:58:53Z

.github/workflows/docs.yml

@@ -44,7 +44,7 @@ jobs:
      - name: Compile JavaDoc
        run: mvn -q clean install -DskipTests && mkdir -p docs/api/javadoc/spark && cp -r spark/common/target/apidocs/* docs/api/javadoc/spark/
      - name: Compile ScalaDoc
-        run: mvn scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark
+        run: mvn generate-sources scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark


This was the only way I could figure out to get the scala docs to be aware of the additional source directory

Kimahriman · 2025-04-16T19:01:48Z

...ark-4.0/src/main/scala/org/apache/spark/sql/execution/python/SedonaArrowEvalPythonExec.scala

+/**
+ * A physical plan that evaluates a [[PythonUDF]].
+ */
+case class SedonaArrowEvalPythonExec(


This arrow eval is the only thing I had to update from the spark-3.5 module to the spark-4.0 module due to some API changes. It looks like starting in 4.1 they added support for UDTs in arrow UDFs

Kontinuation · 2025-04-22T16:17:32Z

common/pom.xml

+                <!-- We need to shade jiffle and it's antlr dependency because Spark 4 uses an
+                     incompatible version of antlr at runtime. -->


Can we shade it in geotools-wrapper so that no dependency reduced pom will be generated when building sedona-common? @jiayuasu

It definitely needs to be shaded locally for the tests to work. I'm not 100% sure if the release could just be shaded into geotools-wrapper or not. My concern was if you somehow have jiffle as a separate dependency, those classes would be used with the provided antlr and not the relocated antlr dependency

Kimahriman · 2025-05-23T15:20:33Z

Working on getting the Python CI to work, realized I never updated that.

In the mean time, do we want to increase min versions for other things to reduce the testing matrix? Specifically maybe dropping Spark 3.3 and Python 3.7 support?

jiayuasu · 2025-05-23T15:25:04Z

@Kimahriman

fine by me. Feel free to create a PR to drop 3.3 support (and code), and remove Python 3.7 from the test matrix.

…shaded issues

Kimahriman · 2025-05-24T11:36:32Z

Ok got Python tests working, just required a few more updates:

Spark 4 requires Pandas 2, so had to upgrade the Pipfile dependency.
Had to also disable DBSCAN python tests
Had to fix jiffle being double-shaded into the spark-shaded module

Now that Spark 4 has been officially released, I think this is ready. The jiffle/antlr shading is the main outstanding question. I think the cleanest approach is to just shade it directly into sedona-common, but I'm not a shading or license expert by any means.

Kimahriman · 2025-05-30T17:31:19Z

I'm on leave right now so won't be able to update off latest master for a couple weeks at least, if anyone else wants to get this finished up. Only outstanding thing before a new release for spark 4 support would be the graphframes dbscan issue. I started trying to work updating that as well but haven't had time to finish up yet and not sure how much progress any of the other people working on graphframes have made

james-willis · 2025-05-30T21:37:41Z

I'm taking a look at graphframes for spark 4

james-willis · 2025-06-25T07:00:30Z

Should we merge this? @Kimahriman is getting close on graphframes 4.0 support as well so we will have DBSCAN unblocked soon as well.

Kimahriman · 2025-06-25T16:05:37Z

Yeah I think we should merge this so it doesn't get behind again, can undo the DBSCAN test changes once a new graphframes release comes out hopefully soonish 🤞

jiayuasu · 2025-06-25T17:46:19Z

Thank you for the hard work @Kimahriman !

Kimahriman · 2025-07-02T17:32:59Z

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

james-willis · 2025-07-02T17:54:29Z

ok im talking to sem about this now. i would like to avoid making a breaking change to our API if possible.

SemyonSinchenko · 2025-07-03T05:12:25Z

@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up

@Kimahriman would such a solution (graphframes/graphframes#620) be OK for you? tldr: setting conf("spark.graphframes.useLabelsAsComponents", "false") will preserve the generated long ID.

Kimahriman · 2025-07-03T14:29:43Z

That sounds fine by me, I think it should be doable to set a temporary SQL conf around the call to connected components

james-willis · 2025-07-03T17:58:05Z

I will probably use the sedona session configurator to default this to true in sedona applications as long as it isnt explicitly set. Ill also add support for returning strings in case this conf is set to true.

Kimahriman added 12 commits February 26, 2025 19:15

Add Spark 4 support

53c2dd6

Use staging repo

5899a21

Try to run tests

3a86979

Work on Spark 4 support

a803f22

Use shims instead of reflection

baf6671

Bump hadoop version

772fd07

Add 4.0 copy, add nulll intolerant shim, and ignore dbscan tests

3392a9e

Fully shade jiffle and antlr

1887a1d

Add back specific exclusions to jiffle

500872f

Fix test

6750e7d

Bump to RC4

c2942ce

Undo workflow change and add comment to shading

de401ab

github-actions bot added sedona-common github-actions sedona-spark root labels Apr 13, 2025

Kimahriman added 7 commits April 12, 2025 23:35

Run pre-commit

5531f2b

Comment in pom

e784494

Merge branch 'master' into spark-4-support

ad0411d

Update snapshot version

81edc3c

Fix strategy import

ba1c79a

Re-sync new spark-4.0 module and update arrow eval for 4.0 changes

531d4dd

Rework off of current arrow eval python

ad28e5d

jiayuasu requested a review from Kontinuation April 14, 2025 21:34

Kimahriman added 2 commits April 15, 2025 07:15

Install pyspark for udf test

fa9d218

Generate sources before scala doc

2a9fc55

Kimahriman commented Apr 16, 2025

View reviewed changes

Kontinuation approved these changes Apr 22, 2025

View reviewed changes

Kimahriman added 2 commits May 21, 2025 18:37

Merge branch 'master' into spark-4-support

04c93be

Use official 4.0 release

ffebe86

Kimahriman added 4 commits May 23, 2025 17:26

Try to fix Python tests

daac5f1

Remove jiffle from spark common since it's not used and causes spark-…

1d15cfd

…shaded issues

Exclude jiffle in spark-shaded

d9326b0

Don't include connect for Spark 4+

acfc191

Kimahriman marked this pull request as ready for review May 24, 2025 11:36

Kimahriman requested a review from jiayuasu as a code owner May 24, 2025 11:36

Kontinuation mentioned this pull request May 30, 2025

[GH-1945] Shade Jiffle and its dependencies #1964

Merged

Merge branch 'master' into spark-4-support

770ef84

github-actions bot removed the sedona-common label Jun 3, 2025

Kimahriman added 2 commits June 9, 2025 16:27

Merge branch 'master' into spark-4-support

7bf2a25

pre-check

b2362a0

jiayuasu approved these changes Jun 25, 2025

View reviewed changes

jiayuasu added this to the sedona-1.8.0 milestone Jun 25, 2025

jiayuasu added attention needed affect public APIs labels Jun 25, 2025

jiayuasu merged commit 0cc1521 into apache:master Jun 25, 2025
35 checks passed

SemyonSinchenko mentioned this pull request Jul 3, 2025

chore: add useLabelsAsComponents configuration for connected components graphframes/graphframes#623

Merged

		<!-- We need to shade jiffle and it's antlr dependency because Spark 4 uses an
		incompatible version of antlr at runtime. -->

[GH-1918] Spark 4 support #1919

[GH-1918] Spark 4 support #1919

Uh oh!

Conversation

Kimahriman commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Kimahriman Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Kimahriman Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Kontinuation Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Kimahriman Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented May 23, 2025

Uh oh!

jiayuasu commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented May 30, 2025

Uh oh!

james-willis commented May 30, 2025

Uh oh!

james-willis commented Jun 25, 2025

Uh oh!

Kimahriman commented Jun 25, 2025

Uh oh!

Uh oh!

jiayuasu commented Jun 25, 2025

Uh oh!

Kimahriman commented Jul 2, 2025

Uh oh!

james-willis commented Jul 2, 2025

Uh oh!

SemyonSinchenko commented Jul 3, 2025

Uh oh!

Kimahriman commented Jul 3, 2025

Uh oh!

james-willis commented Jul 3, 2025

Uh oh!

Uh oh!

Kimahriman commented Apr 13, 2025 •

edited

Loading

jiayuasu commented May 23, 2025 •

edited

Loading

Kimahriman commented May 24, 2025 •

edited

Loading