Conversation
21be2c9 to
1f04b26
Compare
rexminnis
left a comment
There was a problem hiding this comment.
Thanks for putting this together — the CommandLineUtilsBridge pattern and the SparkSubmit rework are clean solutions to the cross-version API drift. A few things I noticed:
- Bug:
spark340/SparkSqlUtils.toArrowRDDhas infinite recursion (see inline comment) - Java target:
maven.compiler.sourceis still1.8— worth bumping to 17? - Spark version:
spark410.versiontargets 4.1.0 — consider 4.1.1 (current release)
Happy to help with testing or any of the shim work. I have a working Spark 4.1.1 setup locally and have been validating the Arrow conversion paths end-to-end.
| ArrowUtils.toArrowSchema(schema = schema, timeZoneId = timeZoneId) | ||
| } | ||
|
|
||
| def toArrowRDD(dataFrame: DataFrame, sparkSession: SparkSession): RDD[Array[Byte]] = { |
There was a problem hiding this comment.
Bug — this is infinitely recursive. SparkSqlUtils.toArrowRDD calls itself:
def toArrowRDD(dataFrame: DataFrame, sparkSession: SparkSession): RDD[Array[Byte]] = {
SparkSqlUtils.toArrowRDD(dataFrame, dataFrame.sparkSession)
}This will StackOverflowError at runtime. Should be dataFrame.toArrowBatchRdd like the other shims (spark322, spark330, spark350).
There was a problem hiding this comment.
good catch, fixed.
| @@ -29,9 +31,9 @@ | |||
| <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> | |||
There was a problem hiding this comment.
The Maven compiler source/target is still 1.8. Since Spark 4.x requires Java 17 at runtime and CI now uses JDK 17, should we bump the compile target to 17 as well? This would catch any bytecode-level incompatibilities at compile time rather than runtime.
| <spark340.version>3.4.0</spark340.version> | ||
| <spark350.version>3.5.0</spark350.version> | ||
| <spark400.version>4.0.0</spark400.version> | ||
| <spark410.version>4.1.0</spark410.version> |
There was a problem hiding this comment.
Minor: spark410.version is 4.1.0 — worth bumping to 4.1.1 (current release)? The SparkShimProvider already covers it at runtime, but compiling against the latest patch would catch any API changes at build time.
There was a problem hiding this comment.
I would keep it 4.1.0 -- the idea is we should support the minimum API from the initial version otherwise the lib might introduce broken changes between Spark's patch versions (Spark is supposed to be backward compatible on patch versions)
7acc670 to
c40d89d
Compare
26d576d to
ac217b9
Compare
This PR adapt raydp with Spark 4.x but leave the following work for future improvement:
To make the tests pass, the PR is based on #458. Once PR#458 is merged this PR should rebase again.