feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1) by ghazariann · Pull Request #386 · kubeflow/sdk

ghazariann · 2026-03-15T14:42:47Z

feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1)

Summary

Implements the batch job portion of Phase 1 from the Spark Client KEP. The connect() API (SparkConnect sessions) was already merged in #225. This PR
adds the batch job layer on top of the same SparkClient and KubernetesBackend.

What's added

Batch job types: SparkJob dataclass and SparkJobStatus enum
ImageLoader abstraction with KindImageLoader for Kind clusters
KubernetesBackend batch methods: submit_job, get_job, list_jobs, delete_job, get_job_logs, wait_for_job
SparkClient exposes all batch methods as public API
Examples: spark_job_simple.py + wordcount.py
E2E test for the batch job example
RBAC: extended e2e role to cover sparkapplications

Intentionally deferred (Phase 2)

Extended submit_job parameters: driver/executor resources, spark conf, image, service account, etc. (consistent with connect() API)
Loaders beyond KindImageLoader
Remote URI main files (s3://, gs://, etc.)
submit_job(func=...): function mode

How it works

client = SparkClient()
job_name = client.submit_job(main_file="etl.py", arguments=["--date", "2024-01-15"])
job = client.wait_for_job_status(job_name, status={SparkJobStatus.COMPLETED, SparkJobStatus.FAILED}, timeout=300)
print(job.status)

Under the hood submit_job does:

Dynamically builds a Docker image on top of the base Spark image using the Docker
client API (discussed in #107)
Loads the image into the Kind cluster
Submits a SparkApplication CR to the Spark Operator v1beta2 API

Test plan

SPARK_TEST_CLUSTER=spark-test SPARK_TEST_NAMESPACE=spark-test uv run pytest test/e2e/spark/test_spark_examples.py::TestSparkJobExamples -v
CI: .github/workflows/test-spark-examples.yaml runs the full file including new TestSparkJobExamples

Ref: #107
KEP: docs/proposals/107-spark-client/README.md

Signed-off-by: Vahagn <vghazaryan@cloudlinux.com>

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

github-actions · 2026-03-15T14:42:55Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

google-oss-prow · 2026-03-15T14:43:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kramaranya for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Adds Spark batch job support to the Kubeflow Spark SDK by submitting SparkApplication (v1beta2) CRDs, including image build/load helpers, public SparkClient APIs, examples, and e2e coverage updates.

Changes:

Introduces SparkJob / SparkJobStatus types and batch-job CRUD + wait + logs methods on the Kubernetes backend and SparkClient.
Adds an ImageLoader abstraction with a KindImageLoader implementation to load locally-built images into Kind.
Adds batch-job examples (spark_job_simple.py, wordcount.py), a new e2e test, workflow deps, and RBAC updates.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test/e2e/spark/test_spark_examples.py	Adds e2e coverage for the new batch-job example.
kubeflow/trainer/types/types.py	Fixes ignore-pattern docstring to match expected glob format.
kubeflow/trainer/constants/constants.py	Fixes default ignore patterns to use `.pt`/`.pth` globs.
kubeflow/spark/types/types.py	Adds batch job dataclass + status enum for SparkApplication jobs.
kubeflow/spark/image/loaders.py	Implements image loading abstraction + Kind loader (others stubbed).
kubeflow/spark/image/init.py	Exposes image-loader utilities as a package API.
kubeflow/spark/backends/kubernetes/utils.py	Adds job naming, image build helper, SparkApplication CR builder + CR→SDK conversion.
kubeflow/spark/backends/kubernetes/constants.py	Adds SparkApplication API constants and job/log-related constants.
kubeflow/spark/backends/kubernetes/backend.py	Implements submit/get/list/delete/logs/wait for SparkApplication batch jobs.
kubeflow/spark/backends/base.py	Extends backend interface with batch-job abstract methods.
kubeflow/spark/api/spark_client.py	Adds public SparkClient batch-job APIs.
kubeflow/spark/init.py	Re-exports batch job types and image loader types.
hack/e2e-setup-cluster.sh	Extends RBAC permissions to cover SparkApplications and renames the role.
examples/spark/wordcount.py	Adds a simple PySpark script used by the batch-job example.
examples/spark/spark_job_simple.py	Adds a batch job example demonstrating submit/wait/logs/delete.
.github/workflows/test-spark-examples.yaml	Installs docker extra for CI runs that build images.

You can also share your feedback on Copilot code review. Take the survey.

kubeflow/spark/backends/kubernetes/utils.py

+    try:
+        docker_sdk.from_env().images.get(image_tag)
+        logger.info("Image '%s' already exists for same file content, reusing it", image_tag)
+        return image_tag
+    except Exception:
+        pass  # Image not found locally, build it


test/e2e/spark/test_spark_examples.py

+    def test_spark_job_simple_example(self):
+        """EX04: Validate spark_job_simple.py submits and completes batch jobs."""
+        namespace = os.environ.get("SPARK_TEST_NAMESPACE", "spark-test")
+        returncode, stdout, stderr, _ = _run_example_with_watcher(
+            EXAMPLES_DIR / "spark_job_simple.py",
+            namespace,
+            timeout_sec=EXAMPLE_TIMEOUT_SEC,
+        )


kubeflow/spark/backends/kubernetes/backend.py

+        except multiprocessing.TimeoutError:
+            yield f"[error] Timeout reading logs for {self.namespace}/{name}"
+        except Exception as e:
+            yield f"[error] Failed to read pod logs for {self.namespace}/{name}: {e}"


kubeflow/spark/backends/base.py

+    def get_job_logs(self, name: str, container: str = "spark-kubernetes-driver", follow: bool = False) -> Iterator[str]:
+        """Get logs from a batch Spark job pod.
+
+        Args:
+            name: Job name.
+            container: Container name to read logs from. Default ``"driver"``.
+            follow: If True, stream logs continuously.


kubeflow/spark/backends/kubernetes/backend.py

+    def submit_job(
+        self,
+        main_file: str,
+        name: str | None = None,
+        arguments: list[str] | None = None,
+        loader: ImageLoader | None = None,
+    ) -> SparkJob:


kubeflow/spark/api/spark_client.py

+    def submit_job(
+        self,
+        func: Callable[[SparkSession], Any] | None = None,
+        func_args: dict[str, Any] | None = None,
+        main_file: str | None = None,
+        main_class: str | None = None,
+        arguments: list[str] | None = None,
+        name: str | None = None,
+    ) -> str:
+        """Submit a batch Spark job.
+
+        Supports two modes based on parameters:
+        - Function mode: Pass `func` to submit a Python function with Spark transformations.
+        - File mode: Pass `main_file` to submit an existing Python/Jar file.
+
+        Args:
+            func: Python function that receives SparkSession (function mode).
+            func_args: Arguments to pass to the function.
+            main_file: Path to Python/Jar file (file mode).
+            main_class: Main class for Jar files.
+            arguments: Command-line arguments for the job.


kubeflow/spark/api/spark_client.py

+    def get_job_logs(
+        self,
+        name: str,
+        container: str = "spark-kubernetes-driver",
+        follow: bool = False,
+    ) -> Iterator[str]:
+        """Get logs from a Spark job (driver or executor)."""
+        return self.backend.get_job_logs(name, container=container, follow=follow)


kubeflow/spark/backends/kubernetes/backend.py

+        job = self.get_job(name)
+        actual_container = constants.SPARK_CONTAINER_NAME_MAP.get(container, container)
+
+        if job.error_message:
+            yield f"[operator] {job.error_message}"
+
+        if not job.driver_pod_name:
+            return
+


Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

…nfo) Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

…test Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

…ntation Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

…r name Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

Vahagn and others added 2 commits March 11, 2026 12:23

fix(trainer): add missing wildcard to .pt and .pth ignore patterns

283e3f3

Signed-off-by: Vahagn <vghazaryan@cloudlinux.com>

docs(trainer): fix wildcard patterns in S3ModelInitializer docstring

50800c7

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

Copilot AI review requested due to automatic review settings March 15, 2026 14:42

google-oss-prow bot added the size/XXL label Mar 15, 2026

google-oss-prow bot requested review from astefanutti, kramaranya and szaher March 15, 2026 14:43

Copilot started reviewing on behalf of ghazariann March 15, 2026 14:43 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

ghazariann added 11 commits March 15, 2026 18:47

feat(spark): add SparkJob dataclass and SparkJobStatus enum

e6cbb15

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): add SparkApplication CRD constants and REMOTE_URI_SCHEMES

373db4b

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): add batch job utils (image build, CR builder, job name/i…

2633f04

…nfo) Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): implement batch job methods in KubernetesBackend

6eec4e4

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): expose batch job API on SparkClient and public __init__

84eaab3

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): extend RBAC for sparkapplications and add batch job E2E …

69cb4ca

…test Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): add ImageLoader abstraction with KindImageLoader impleme…

55155a5

…ntation Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

feat(spark): add batch job examples (spark_job_simple, wordcount)

97b8059

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

fix(spark): remove K3dImageLoader from public image exports

c3a39df

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

fix(spark): use SPARK_TEST_CLUSTER env var for KindImageLoader cluste…

6d1984b

…r name Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

fix(ci): install docker extra for batch job image build in e2e tests

ecf520b

Signed-off-by: ghazariann <vahagn.ghazayan@gmail.com>

ghazariann force-pushed the feature/spark-batch-jobs branch from 96a6ec4 to ecf520b Compare March 15, 2026 14:49

ghazariann changed the title ~~Feature/spark batch jobs~~ feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1) Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1)#386

feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1)#386
ghazariann wants to merge 13 commits intokubeflow:mainfrom
ghazariann:feature/spark-batch-jobs

ghazariann commented Mar 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

google-oss-prow bot commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghazariann commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(spark): implement batch job support via SparkApplication CRD (KEP-107 Phase 1)

Summary

What's added

Intentionally deferred (Phase 2)

How it works

Test plan

Uh oh!

github-actions bot commented Mar 15, 2026

Uh oh!

google-oss-prow bot commented Mar 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghazariann commented Mar 15, 2026 •

edited

Loading