Skip to content

Support Spark 4.x#450

Open
pang-wu wants to merge 34 commits intoray-project:masterfrom
pang-wu:pang/spark4
Open

Support Spark 4.x#450
pang-wu wants to merge 34 commits intoray-project:masterfrom
pang-wu:pang/spark4

Conversation

@pang-wu
Copy link
Collaborator

@pang-wu pang-wu commented Dec 8, 2025

This PR adapt raydp with Spark 4.x but leave the following work for future improvement:

  1. Support tensorflow 2.16+ (see https://keras.io/getting_started/#tensorflow--keras-2-backwards-compatibility) and numpy 2.x
  2. Support python 3.12 - we deprecated Python 3.9 because it is no longer supported by Spark. Need to modernize python build system.
  3. Deprecate Ray AIR.

To make the tests pass, the PR is based on #458. Once PR#458 is merged this PR should rebase again.

@pang-wu pang-wu changed the title Support SPark 4.0.0 Support Spark 4.0.0 Dec 8, 2025
@pang-wu pang-wu force-pushed the pang/spark4 branch 7 times, most recently from 21be2c9 to 1f04b26 Compare February 16, 2026 17:47
@pang-wu pang-wu changed the title Support Spark 4.0.0 Support Spark 4.x Feb 16, 2026
Copy link
Contributor

@rexminnis rexminnis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together — the CommandLineUtilsBridge pattern and the SparkSubmit rework are clean solutions to the cross-version API drift. A few things I noticed:

  1. Bug: spark340/SparkSqlUtils.toArrowRDD has infinite recursion (see inline comment)
  2. Java target: maven.compiler.source is still 1.8 — worth bumping to 17?
  3. Spark version: spark410.version targets 4.1.0 — consider 4.1.1 (current release)

Happy to help with testing or any of the shim work. I have a working Spark 4.1.1 setup locally and have been validating the Arrow conversion paths end-to-end.

ArrowUtils.toArrowSchema(schema = schema, timeZoneId = timeZoneId)
}

def toArrowRDD(dataFrame: DataFrame, sparkSession: SparkSession): RDD[Array[Byte]] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — this is infinitely recursive. SparkSqlUtils.toArrowRDD calls itself:

def toArrowRDD(dataFrame: DataFrame, sparkSession: SparkSession): RDD[Array[Byte]] = {
    SparkSqlUtils.toArrowRDD(dataFrame, dataFrame.sparkSession)
}

This will StackOverflowError at runtime. Should be dataFrame.toArrowBatchRdd like the other shims (spark322, spark330, spark350).

Copy link
Collaborator Author

@pang-wu pang-wu Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed.

@@ -29,9 +31,9 @@
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Maven compiler source/target is still 1.8. Since Spark 4.x requires Java 17 at runtime and CI now uses JDK 17, should we bump the compile target to 17 as well? This would catch any bytecode-level incompatibilities at compile time rather than runtime.

<spark340.version>3.4.0</spark340.version>
<spark350.version>3.5.0</spark350.version>
<spark400.version>4.0.0</spark400.version>
<spark410.version>4.1.0</spark410.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: spark410.version is 4.1.0 — worth bumping to 4.1.1 (current release)? The SparkShimProvider already covers it at runtime, but compiling against the latest patch would catch any API changes at build time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it 4.1.0 -- the idea is we should support the minimum API from the initial version otherwise the lib might introduce broken changes between Spark's patch versions (Spark is supposed to be backward compatible on patch versions)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants