-
Couldn't load subscription status.
- Fork 1.9k
[Spark]Refactor Spark project structure to combine both Dsv1 connector and kernel backed Dsv2 connector #5320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
build.sbt
Outdated
| // Module 3: delta-spark-v2 (kernel-spark based, depends on v1-shaded) | ||
| // ============================================================ | ||
| lazy val `delta-spark-v2` = (project in file("kernel-spark")) | ||
| .dependsOn(`delta-spark-v1-shaded`) // Only depends on shaded v1 (no DeltaLog) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly what i had in mind.
build.sbt
Outdated
| // Test sources point to original spark/src/test/ (no file movement) | ||
| Test / unmanagedSourceDirectories ++= Seq( | ||
| baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala", | ||
| baseDirectory.value.getParentFile / "spark" / "src" / "test" / "java" | ||
| ), | ||
| Test / unmanagedResourceDirectories += | ||
| baseDirectory.value.getParentFile / "spark" / "src" / "test" / "resources", | ||
|
|
||
| // Include spark-version-specific test sources | ||
| Test / unmanagedSourceDirectories ++= { | ||
| val sparkVer = sparkVersion.value | ||
| if (sparkVer.startsWith("3.5")) { | ||
| Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-3.5") | ||
| } else if (sparkVer.startsWith("4.0")) { | ||
| Seq(baseDirectory.value.getParentFile / "spark" / "src" / "test" / "scala-spark-master") | ||
| } else { | ||
| Seq.empty | ||
| } | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be simpler to actually move the files, but this is okay for the first cut.
build.sbt
Outdated
| // ============================================================ | ||
| // Module 1: delta-spark-v1 (prod code only, no tests) | ||
| // ============================================================ | ||
| lazy val `delta-spark-v1` = (project in file("spark")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you using "`" (backticks) ?
why not just name it sparkV1 (it all already delta)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and fundamentally.. this is scala... and scala variables. nobody uses - in variable names :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, will rename
build.sbt
Outdated
| // ============================================================ | ||
| lazy val spark = (project in file("spark-combined")) | ||
| .dependsOn(`delta-spark-shaded`) // Direct dependency on shaded (for delegation classes) | ||
| .dependsOn(`delta-spark-v1` % "test->test") // Test utilities from v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this? v1 has no tests.
build.sbt
Outdated
| // This module contains delegation code like: | ||
| // - DeltaCatalog (delegates to V1 or V2) | ||
| // - DeltaSparkSessionExtension (registers both) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really have to shade anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments and many of those are stale, I will fix them (after fixing tests) and let you know then
kernel-spark/src/main/java/io/delta/kernel/spark/read/SparkBatch.java
Outdated
Show resolved
Hide resolved
|
Hi @tdas @gengliangwang @scottsand-db -- please take another look when you have a chance, thanks |
| */ | ||
| class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) { | ||
| // V1 entry point class for backwards compatibility with Spark V1 APIs. | ||
| class V1DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why V1ClassName, and not ClassNameV1?
we have been following the pattern of version at the end... isnt it? DeltaTableV2?
| @@ -80,7 +80,11 @@ import org.apache.spark.sql.internal.SQLConf | |||
| * | |||
| * @since 0.4.0 | |||
| */ | |||
| class DeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) { | |||
| // V1 entry point class for backwards compatibility with Spark V1 APIs. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docs above all refer to DeltaSparkSessionExtension with very nice CUJs where as this class is a different name.
i think the right thing to do is move this entire doc to the new DeltaSparkSessionExtension class. Along with the explanation of about delegation logic.
And in this V1DeltaSparkSessionExtension class has docs that ask users to see the docs on DeltaSparkSessionExtension.
With that there is a central place that documents this whole things and other places points to that one source of truth.
| class V1DeltaSparkSessionExtension extends AbstractDeltaSparkSessionExtension | ||
|
|
||
| // Abstract base class that contains the core Delta Spark Session extension logic. | ||
| class AbstractDeltaSparkSessionExtension extends (SparkSessionExtensions => Unit) { | ||
| override def apply(extensions: SparkSessionExtensions): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also .. the file name should be changed to match the class name
| */ | ||
| class DeltaCatalog extends DelegatingCatalogExtension | ||
| // V1 entry point class for backwards compatibility with Spark V1 APIs. | ||
| class V1DeltaCatalog extends AbstractDeltaCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about ClassNameVx as i did for the extension. https://github.com/delta-io/delta/pull/5320/files#r2466822707
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also same comment about docs and file name as i did here -
https://github.com/delta-io/delta/pull/5320/files#r2466825844
spark-unified/README.md
Outdated
| # Delta Spark Unified Module | ||
|
|
||
| This module contains the final, published `delta-spark` JAR that unifies both: | ||
| - **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support | |
| - **V1 (hybrid DSv1 and DSv2)**: The traditional Delta Lake connector with full `DeltaLog` support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt that true? since the catalog is already v2 .. right?
spark-unified/README.md
Outdated
|
|
||
| This module contains the final, published `delta-spark` JAR that unifies both: | ||
| - **V1 (DSv1)**: The traditional Delta Lake connector with full `DeltaLog` support | ||
| - **V2 (DSv2)**: The new Kernel-backed connector for improved performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **V2 (DSv2)**: The new Kernel-backed connector for improved performance | |
| - **V2 (pure DSv2)**: The new Kernel-backed connector for improved performance |
| **Published module**: | ||
| - `delta-spark` (this module) - contains merged classes from all internal modules | ||
|
|
||
| ## Build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this page is great.
in a follow PR as i said offline. we should make all spark stuff in a one top level dir
spark/
- v1/
- v2/
- unified/
- README.md (move this read me at this top level dir)
| catalogImplConfig, classOf[DeltaCatalog].getName, | ||
| classOf[DeltaSparkSessionExtension].getName, | ||
| catalogImplConfig, classOf[DeltaCatalog].getName), | ||
| messageParameters = Array(classOf[AbstractDeltaSparkSessionExtension].getName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this produce? earlier it produced the name DeltaSparkSessionExtension and DeltaCatalog. but now wont this produce Abstract***?
if so, we dont want to expose that in the error message since the user CUJ must not change.. right?
|
|
||
| checkError(e, "DELTA_CONFIGURE_SPARK_SESSION_WITH_EXTENSION_AND_CATALOG", "56038", Map( | ||
| "sparkSessionExtensionName" -> classOf[DeltaSparkSessionExtension].getName, | ||
| "sparkSessionExtensionName" -> classOf[AbstractDeltaSparkSessionExtension].getName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually.. i think this test is wrong. if the goal of this test is to verify that the error message is as expected.. it should hardcode the exact class name string that is expected so that the test catches
please find out from delta-master what this string should be (ideally same as the documented full class name in the CUJ) and hard code that here. only then we will be sure that we are never breaking any error message tied to CUJ
| .master("local[*]") | ||
| .appName("SparkKernelDsv2Tests") | ||
| .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") | ||
| .config("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to change? this is in kernel-spark, so it should have access to the new DeltaSparkSessionSessionExtension.. isnt it?
same for the DeltaCatalog below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kernel-spark/ is location for SparkV2, DeltaSparkSessionSessionExtension is in Spark-unified/ (for accessing both V1 + V2 code) so kernel-spark/ cannot access DeltaSparkSessionSessionExtension.
We will need to use DeltaSparkSessionExtensionV1 here for test
| .set("spark.sql.catalog.dsv2", "io.delta.kernel.spark.catalog.TestCatalog") | ||
| .set("spark.sql.catalog.dsv2.base_path", tempDir.getAbsolutePath()) | ||
| .set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") | ||
| .set("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as below.
| .set("spark.sql.catalog.dsv2", "io.delta.kernel.spark.catalog.TestCatalog") | ||
| .set("spark.sql.catalog.dsv2.base_path", tempDir.getAbsolutePath()) | ||
| .set("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") | ||
| .set("spark.sql.extensions", "io.delta.sql.V1DeltaSparkSessionExtension") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as below.
Which Delta project/connector is this regarding?
Description
This PR refactors the Delta Spark build system to support a modular architecture that separates V1 and V2 implementations while maintaining a single published
delta-sparkjar that could use connector for both V1 and V2 and with public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.Architecture Changes
Module Structure:
sparkV1(internal): Delta Spark V1 implementation (production code only, no tests)sparkV1Shaded(internal): V1 withoutDeltaLog/Snapshot/OptimisticTransactionclasses, used to avoid V2 connector accidentally depend on legacy V1 representation of delta log.sparkV2(internal): Kernel-based Delta Spark implementation (formerlykernelSpark)spark(combined): Final published module that merges V1 + V2 + storage intodelta-spark.jarDetails Changes
Rename old catalog plugin:
Abstract*base classes insparkV1andLegacy*subclasses for backward compatibilityspark-combined:DeltaCatalogextendsAbstractDeltaCatalogDeltaSparkSessionExtensionextendsAbstractDeltaSparkSessionExtensionThis makes public entry point DeltaCatalog and DeltaSparkSessionExtension unchanged.
Single Jar rules all connectors:
sparkV1,sparkV2,sparkV1Shaded) marked withskipReleaseSettings- not published to Mavenspark(combined) usespackageBin / mappingsto merge classes from internal modulesTest Configuration:
spark(combined) module withTest / baseDirectorypointing tospark/directoryHow was this patch tested?
All existing unit tests pass
python3 run-integration-tests.py --unity-catalog-commit-coordinator-integration-tests matches master branch
Does this PR introduce any user-facing changes?
No user-facing changes. This is an internal build system refactoring:
No