Skip to content

Conversation

@huan233usc
Copy link
Collaborator

@huan233usc huan233usc commented Oct 9, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR refactors the Delta Spark build system to support a modular architecture that separates V1 and V2 implementations while maintaining a single published delta-spark jar that could use connector for both V1 and V2 and with public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.

Architecture Changes

Module Structure:

  • sparkV1 (internal): Delta Spark V1 implementation (production code only, no tests)
  • sparkV1Shaded (internal): V1 without DeltaLog/Snapshot/OptimisticTransaction classes, used to avoid V2 connector accidentally depend on legacy V1 representation of delta log.
  • sparkV2 (internal): Kernel-based Delta Spark implementation (formerly kernelSpark)
  • spark (combined): Final published module that merges V1 + V2 + storage into delta-spark.jar

Details Changes

Rename old catalog plugin:

  • Introduced Abstract* base classes in sparkV1 and Legacy* subclasses for backward compatibility
  • New unified classes in spark-combined:
    • DeltaCatalog extends AbstractDeltaCatalog
    • DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension
      This makes public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.

Single Jar rules all connectors:

  • Internal modules (sparkV1, sparkV2, sparkV1Shaded) marked with skipReleaseSettings - not published to Maven
  • spark (combined) uses packageBin / mappings to merge classes from internal modules
  • Duplicate class detection added to prevent overlapping code
  • POM post-processing excludes internal module dependencies

Test Configuration:

  • Tests run in spark (combined) module with Test / baseDirectory pointing to spark/ directory
  • I could run tests for both V1/V2 connector

How was this patch tested?

All existing unit tests pass

python3 run-integration-tests.py --unity-catalog-commit-coordinator-integration-tests matches master branch

Does this PR introduce any user-facing changes?

No user-facing changes. This is an internal build system refactoring:

No

build.sbt Outdated
// Module 3: delta-spark-v2 (kernel-spark based, depends on v1-shaded)
// ============================================================
lazy val `delta-spark-v2` = (project in file("kernel-spark"))
.dependsOn(`delta-spark-v1-shaded`) // Only depends on shaded v1 (no DeltaLog)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly what i had in mind.

build.sbt Outdated
Comment on lines 680 to 698
// Test sources point to original spark/src/test/ (no file movement)
Test / unmanagedSourceDirectories ++= Seq(
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala",
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "java"
),
Test / unmanagedResourceDirectories +=
baseDirectory.value.getParentFile / "spark" / "src" / "test" / "resources",

// Include spark-version-specific test sources
Test / unmanagedSourceDirectories ++= {
val sparkVer = sparkVersion.value
if (sparkVer.startsWith("3.5")) {
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-3.5")
} else if (sparkVer.startsWith("4.0")) {
Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-master")
} else {
Seq.empty
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be simpler to actually move the files, but this is okay for the first cut.

build.sbt Outdated
// ============================================================
// Module 1: delta-spark-v1 (prod code only, no tests)
// ============================================================
lazy val `delta-spark-v1` = (project in file("spark"))
Copy link
Contributor

@tdas tdas Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you using "`" (backticks) ?
why not just name it sparkV1 (it all already delta)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and fundamentally.. this is scala... and scala variables. nobody uses - in variable names :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, will rename

build.sbt Outdated
// ============================================================
lazy val spark = (project in file("spark-combined"))
.dependsOn(`delta-spark-shaded`) // Direct dependency on shaded (for delegation classes)
.dependsOn(`delta-spark-v1` % "test->test") // Test utilities from v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this? v1 has no tests.

build.sbt Outdated
Comment on lines 579 to 581
// This module contains delegation code like:
// - DeltaCatalog (delegates to V1 or V2)
// - DeltaSparkSessionExtension (registers both)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really have to shade anything?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments and many of those are stale, I will fix them (after fixing tests) and let you know then

@huan233usc huan233usc changed the title [POC]New spark structure [WIP][POC]New spark structure Oct 10, 2025
@huan233usc huan233usc requested a review from tdas October 24, 2025 06:56
@huan233usc huan233usc self-assigned this Oct 24, 2025
@huan233usc
Copy link
Collaborator Author

huan233usc commented Oct 25, 2025

Hi @tdas @gengliangwang @scottsand-db -- please take another look when you have a chance, thanks

*/
class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) {
// V1 entry point class for backwards compatibility with Spark V1 APIs.
class V1DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why V1ClassName, and not ClassNameV1?
we have been following the pattern of version at the end... isnt it? DeltaTableV2?

@@ -80,7 +80,11 @@ import org.apache.spark.sql.internal.SQLConf
*
* @since 0.4.0
*/
class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) {
// V1 entry point class for backwards compatibility with Spark V1 APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs above all refer to DeltaSparkSessionExtension with very nice CUJs where as this class is a different name.

i think the right thing to do is move this entire doc to the new DeltaSparkSessionExtension class. Along with the explanation of about delegation logic.

And in this V1DeltaSparkSessionExtension class has docs that ask users to see the docs on DeltaSparkSessionExtension.

With that there is a central place that documents this whole things and other places points to that one source of truth.

class V1DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension

// Abstract base class that contains the core Delta Spark Session extension logic.
class AbstractDeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) {
override def apply(extensions: SparkSessionExtensions): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also .. the file name should be changed to match the class name

*/
class DeltaCatalog extends DelegatingCatalogExtension
// V1 entry point class for backwards compatibility with Spark V1 APIs.
class V1DeltaCatalog extends AbstractDeltaCatalog
Copy link
Contributor

@tdas tdas Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about ClassNameVx as i did for the extension. https://github.com/delta-io/delta/pull/5320/files#r2466822707

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also same comment about docs and file name as i did here -
https://github.com/delta-io/delta/pull/5320/files#r2466825844

# Delta Spark Unified Module

This module contains the final, published `delta-spark` JAR that unifies both:
- **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support
- **V1 (hybrid DSv1 and DSv2)**: The traditional Delta Lake connector with full `DeltaLog` support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt that true? since the catalog is already v2 .. right?


This module contains the final, published `delta-spark` JAR that unifies both:
- **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support
- **V2 (DSv2)**: The new Kernel-backed connector for improved performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **V2 (DSv2)**: The new Kernel-backed connector for improved performance
- **V2 (pure DSv2)**: The new Kernel-backed connector for improved performance

**Published module**:
- `delta-spark` (this module) - contains merged classes from all internal modules

## Build
Copy link
Contributor

@tdas tdas Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this page is great.

in a follow PR as i said offline. we should make all spark stuff in a one top level dir

spark/
    - v1/
    - v2/
    - unified/
    - README.md (move this read me at this top level dir) 

catalogImplConfig, classOf[DeltaCatalog].getName,
classOf[DeltaSparkSessionExtension].getName,
catalogImplConfig, classOf[DeltaCatalog].getName),
messageParameters = Array(classOf[AbstractDeltaSparkSessionExtension].getName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this produce? earlier it produced the name DeltaSparkSessionExtension and DeltaCatalog. but now wont this produce Abstract***?
if so, we dont want to expose that in the error message since the user CUJ must not change.. right?


checkError(e, "DELTA_CONFIGURE_SPARK_SESSION_WITH_EXTENSION_AND_CATALOG", "56038", Map(
"sparkSessionExtensionName" -> classOf[DeltaSparkSessionExtension].getName,
"sparkSessionExtensionName" -> classOf[AbstractDeltaSparkSessionExtension].getName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually.. i think this test is wrong. if the goal of this test is to verify that the error message is as expected.. it should hardcode the exact class name string that is expected so that the test catches

please find out from delta-master what this string should be (ideally same as the documented full class name in the CUJ) and hard code that here. only then we will be sure that we are never breaking any error message tied to CUJ

.master("local[*]")
.appName("SparkKernelDsv2Tests")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to change? this is in kernel-spark, so it should have access to the new DeltaSparkSessionSessionExtension.. isnt it?

same for the DeltaCatalog below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel-spark/ is location for SparkV2, DeltaSparkSessionSessionExtension is in Spark-unified/ (for accessing both V1 + V2 code) so kernel-spark/ cannot access DeltaSparkSessionSessionExtension.

We will need to use DeltaSparkSessionExtensionV1 here for test

.set("spark.sql.catalog.dsv2", "io.delta.kernel.spark.catalog.TestCatalog")
.set("spark.sql.catalog.dsv2.base_path", tempDir.getAbsolutePath())
.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.set("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below.

.set("spark.sql.catalog.dsv2", "io.delta.kernel.spark.catalog.TestCatalog")
.set("spark.sql.catalog.dsv2.base_path", tempDir.getAbsolutePath())
.set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.set("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants