-
Notifications
You must be signed in to change notification settings - Fork 717
[GH-1918] Spark 4 support #1919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.github/workflows/docs.yml
Outdated
@@ -44,7 +44,7 @@ jobs: | |||
- name: Compile JavaDoc | |||
run: mvn -q clean install -DskipTests && mkdir -p docs/api/javadoc/spark && cp -r spark/common/target/apidocs/* docs/api/javadoc/spark/ | |||
- name: Compile ScalaDoc | |||
run: mvn scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark | |||
run: mvn generate-sources scala:doc && mkdir -p docs/api/scaladoc/spark && cp -r spark/common/target/site/scaladocs/* docs/api/scaladoc/spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only way I could figure out to get the scala docs to be aware of the additional source directory
/** | ||
* A physical plan that evaluates a [[PythonUDF]]. | ||
*/ | ||
case class SedonaArrowEvalPythonExec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arrow eval is the only thing I had to update from the spark-3.5
module to the spark-4.0
module due to some API changes. It looks like starting in 4.1 they added support for UDTs in arrow UDFs
common/pom.xml
Outdated
<!-- We need to shade jiffle and it's antlr dependency because Spark 4 uses an | ||
incompatible version of antlr at runtime. --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we shade it in geotools-wrapper so that no dependency reduced pom will be generated when building sedona-common? @jiayuasu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely needs to be shaded locally for the tests to work. I'm not 100% sure if the release could just be shaded into geotools-wrapper or not. My concern was if you somehow have jiffle as a separate dependency, those classes would be used with the provided antlr and not the relocated antlr dependency
Working on getting the Python CI to work, realized I never updated that. In the mean time, do we want to increase min versions for other things to reduce the testing matrix? Specifically maybe dropping Spark 3.3 and Python 3.7 support? |
fine by me. Feel free to create a PR to drop 3.3 support (and code), and remove Python 3.7 from the test matrix. |
Ok got Python tests working, just required a few more updates:
Now that Spark 4 has been officially released, I think this is ready. The jiffle/antlr shading is the main outstanding question. I think the cleanest approach is to just shade it directly into sedona-common, but I'm not a shading or license expert by any means. |
I'm on leave right now so won't be able to update off latest master for a couple weeks at least, if anyone else wants to get this finished up. Only outstanding thing before a new release for spark 4 support would be the graphframes dbscan issue. I started trying to work updating that as well but haven't had time to finish up yet and not sure how much progress any of the other people working on graphframes have made |
I'm taking a look at graphframes for spark 4 |
Should we merge this? @Kimahriman is getting close on graphframes 4.0 support as well so we will have DBSCAN unblocked soon as well. |
Yeah I think we should merge this so it doesn't get behind again, can undo the DBSCAN test changes once a new graphframes release comes out hopefully soonish 🤞 |
Thank you for the hard work @Kimahriman ! |
@james-willis I started testing out a local build of the graphframes updates and actually getting some failing tests for DBSCAN. Looks like it has to do with graphframes/graphframes#320 which preserves the original ID instead of always using the generated long ID, so the component ID is not always a long anymore. Not sure the best way to address that with how the physical functions and such work. Would be great if you could take a look since you set most of that up |
ok im talking to sem about this now. i would like to avoid making a breaking change to our API if possible. |
@Kimahriman would such a solution (graphframes/graphframes#620) be OK for you? tldr: setting |
That sounds fine by me, I think it should be doable to set a temporary SQL conf around the call to connected components |
I will probably use the sedona session configurator to default this to true in sedona applications as long as it isnt explicitly set. Ill also add support for returning strings in case this conf is set to true. |
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject
.Resolves #1918
What changes were proposed in this PR?
Add support for Spark 4.
This required several updates:
spark/common
module has source directories for Spark 3 and 4 respectively. I had to do it in the common module because things in the common module depend on the version specific shims. The main breaking changes that required this are:Column
objects are no longer wrappers aroundExpression
objects, but a newColumnNode
construct for Spark Connect support. Supporting the expression wrapping requires a different setup. Initially I started working on this through reflection, but this got pretty messy and this will require different artifacts anyway, so I added the conditional source directories.NullIntolerant
trait no longer exists, instead it's a an overridable function on an expressionjt-jiffle-language
and it'santlr
dependency have to be shaded into thecommon
module for Spark 4 to work. This is because in antlr 4.10 there was some internal version bump such that dependencies compiled with antlr < 4.10 can't run at runtime with >= 4.10. I thinkjt-jiffle-language
has an Apache license so I think this is ok? Currently it's a provided dependency that comes with the external geotools-wrapper. But need some verification here or thoughts on any alternative approach.spark-3.5
module as is tospark-4.0
. The only changes I had to make were to the new Arrow UDF stuff that was added recently. Could these also just be moved as conditional source directories inspark/common
?How was this patch tested?
Existing UTs.
Did this PR include necessary documentation updates?
Maybe supported versions need to change? Haven't looked at the docs yet.