Skip to content

feat: add SparkClient API for SparkConnect session management#225

Merged
google-oss-prow[bot] merged 14 commits intokubeflow:mainfrom
Shekharrajak:feat/sparkclient-mvp
Feb 12, 2026
Merged

feat: add SparkClient API for SparkConnect session management#225
google-oss-prow[bot] merged 14 commits intokubeflow:mainfrom
Shekharrajak:feat/sparkclient-mvp

Conversation

@Shekharrajak
Copy link
Copy Markdown
Member

@Shekharrajak Shekharrajak commented Jan 13, 2026

Closes #163 as per the KEP.

Features

  • SparkClient API: High-level client for SparkConnect lifecycle management
  • KubernetesSparkBackend: Backend implementation for SparkConnect CRD operations
  • SparkClient.connect(): Get PySpark SparkSession from existing server or auto-create new session

Quick Test

Option 1: Connect to Local Spark Connect Server (No Kubernetes required)

# 1. Start local Spark Connect server
docker run -d --name spark-connect -p 15002:15002 \
  apache/spark:3.5.0 \
  /opt/spark/sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:3.5.0
# 2. Wait ~30s for startup, then test
cd sdk
uv run python kubeflow/spark/examples/test_connect_url.py
# 3. Cleanup
docker stop spark-connect && docker rm spark-connect

Option 2: With Kind Cluster + Spark Operator

# 1. Setup Kind cluster with Spark Operator
./scripts/spark/setup-kind.sh

# 2. Run demo
uv run python kubeflow/spark/examples/demo_existing_sparkconnect.py

Examples:

# install inside this branch:  pip install -e ".[spark]" 
# Connect to existing SparkConnect
from kubeflow.spark import SparkClient
spark = SparkClient().connect(url="sc://localhost:15002")
df = spark.range(10)
df.show()

# ---- kubernetes 
from kubeflow.spark import SparkClient
from kubeflow.common.types import KubernetesBackendConfig

client = SparkClient(backend_config=KubernetesBackendConfig(namespace="spark-test"))
# create a new session and connect
spark = client.connect(name="my-session")

Copilot AI review requested due to automatic review settings January 13, 2026 14:22
@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a SparkClient API for managing SparkConnect sessions on Kubernetes, enabling high-level lifecycle management and PySpark integration.

Changes:

  • Added SparkClient API with support for connecting to existing servers or auto-creating new sessions
  • Implemented KubernetesSparkBackend for SparkConnect CRD operations
  • Added comprehensive type definitions (Driver, Executor, SparkConnectInfo, SparkConnectState)

Reviewed changes

Copilot reviewed 112 out of 476 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pyproject.toml Added spark extra dependencies (pyspark==3.5.0, grpcio, pandas, pyarrow)
uv.lock Updated lock file with new dependencies and resolution markers
kubeflow/spark/init.py Package exports for SparkClient and related types
kubeflow/spark/api/spark_client.py Main SparkClient implementation with connect() method
kubeflow/spark/backends/base.py Abstract base class defining backend interface
kubeflow/spark/backends/kubernetes/backend.py Kubernetes backend implementing SparkConnect CRD operations
kubeflow/spark/backends/kubernetes/utils.py Utility functions for CRD building and validation
kubeflow/spark/backends/kubernetes/constants.py Constants for SparkConnect CRD and defaults
kubeflow/spark/types/types.py Core type definitions (SparkConnectInfo, Driver, Executor, etc.)
kubeflow/spark/examples/*.py Example scripts demonstrating usage
kubeflow/spark/integration/*.py Integration tests with pytest fixtures
scripts/spark/setup-kind.sh Kind cluster setup script for testing
docs/spark-connect-local-testing.md Documentation for local testing

@Shekharrajak
Copy link
Copy Markdown
Member Author

Cleaning up commits.

@Shekharrajak
Copy link
Copy Markdown
Member Author

from kubeflow.spark import SparkClient

# Connect to localhost after port-forwarding
spark = SparkClient().connect(url="sc://localhost:15002")
df = spark.range(10)
df.show()
Screenshot 2026-01-16 at 11 30 34 PM

@Shekharrajak
Copy link
Copy Markdown
Member Author

cd kubeflow/sdk

  python3 << 'EOF'
  from kubeflow.spark.api.spark_client import SparkClient

  print("Creating SparkClient...")
  client = SparkClient()

  print("Connecting to sc://localhost:15002...")
  spark = client.connect(url="sc://localhost:15002")

  print("Creating DataFrame...")
  df = spark.range(10)

  print("Counting rows...")
  count = df.count()

  print(f"SUCCESS! Count = {count}")
  assert count == 10
  EOF

  Expected output:
  Creating SparkClient...
  Connecting to sc://localhost:15002...
  Connected! Creating DataFrame...
  Counting rows...
  SUCCESS! Count = 10

 Run Full Integration Test Suite

  uv run pytest kubeflow/spark/integration/test_existing_sparkconnect.py -v

  Expected output:
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_connect_to_existing_server PASSED [ 16%]
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_dataframe_operations PASSED [ 33%]
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_sql_operations PASSED [ 50%]
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_multi_column_dataframe PASSED [ 66%]
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_multiple_transformations PASSED [ 83%]
  test_existing_sparkconnect.py::TestExistingSparkConnect::test_groupby_aggregations PASSED [100%]

  ================== 6 passed in 318.59s (0:05:18) ==================

  Step 3: Run Individual Tests (Optional)

  # basic connection
  uv run pytest kubeflow/spark/integration/test_existing_sparkconnect.py::TestExistingSparkConnect::test_connect_to_existing_server -v

  # Test DataFrame operations
  uv run pytest kubeflow/spark/integration/test_existing_sparkconnect.py::TestExistingSparkConnect::test_dataframe_operations -v

  # Test SQL operations
  uv run pytest kubeflow/spark/integration/test_existing_sparkconnect.py::TestExistingSparkConnect::test_sql_operations -v


Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this effort @Shekharrajak! Overall looks good, and I am looking forward to this!
I left my initial comments.


## Option 1: Docker-based Spark Connect Server (Recommended)

### Start Spark Connect Server
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so our API is working even without Kubernetes cluster?
Shall we re-use this functionality to run SparkClient() with container backend?
For example, when user runs connect() with ContainerBackend, we start spark container and allow user to connect to the Spark session.

We can add support for it in the follow up PRs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it is not automated but as you mentioned correctly we can have ContainerBackendConfig

client = SparkClient(backend_config=ContainerBackendConfig())
spark = client.connect()  # Auto-starts container, connects when ready

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have this enhancement issue open for future implementation. Thanks!

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @Shekharrajak!
mostly lgtm, I left a few comments.

@@ -0,0 +1,150 @@
# Spark Integration Testing Guide
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kramaranya recently added the docs for Kubeflow SDK: https://sdk.kubeflow.org/en/latest/
Please can you create page for Spark there?

Comment on lines +49 to +59
delete_cluster() {
log_info "Deleting Kind cluster: $CLUSTER_NAME"
kind delete cluster --name "$CLUSTER_NAME" 2>/dev/null || true
log_info "Cluster deleted"
}

create_cluster() {
if kind get clusters 2>/dev/null | grep -q "^${CLUSTER_NAME}$"; then
log_warn "Cluster '$CLUSTER_NAME' already exists"
return 0
fi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for create and delete.
Do you want to just copy this script and add installation of Spark Operator from the helm charts?

@coveralls
Copy link
Copy Markdown

coveralls commented Jan 29, 2026

Pull Request Test Coverage Report for Build 21941935956

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 1115 of 1292 (86.3%) changed or added relevant lines in 18 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+4.3%) to 72.296%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/spark/backends/kubernetes/backend_test.py 214 215 99.53%
kubeflow/spark/types/options.py 88 89 98.88%
kubeflow/spark/api/spark_client.py 27 32 84.38%
kubeflow/spark/backends/base.py 17 23 73.91%
kubeflow/spark/backends/kubernetes/utils.py 95 103 92.23%
kubeflow/spark/types/validation.py 8 58 13.79%
kubeflow/spark/backends/kubernetes/backend.py 159 265 60.0%
Totals Coverage Status
Change from base Build 21919105330: 4.3%
Covered Lines: 3943
Relevant Lines: 5454

💛 - Coveralls

create_cluster
setup_test_namespace
install_spark_operator
if [[ "${E2E_CRD_ONLY:-0}" == "1" ]]; then
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installing spark operator was taking more than 10 min and timed out . For now we have sparkconnect CRD check and validating CRD + API acceptance only (create/get/delete SparkConnect).

We are missing validation :

End-to-end Spark session
• client.connect() is not run in CI.
• No check that we get a real SparkSession, that spark.range(), df.count(), spark.stop() work, or that the SDK’s wait/ready logic works.

• test_spark_connect_simple_example and test_spark_advanced_options_example are not run in CI anymore.
• So we don’t validate that examples/spark/spark_connect_simple.py and spark_advanced_options.py succeed against a real cluster.

Still not sure why it is taking long time. Also we have not release spark operator helm version which have sparkconnect CRD available so I was trying to build from spark-operator codebase directly .

@Shekharrajak
Copy link
Copy Markdown
Member Author

Getting some errors in CI like :

kubectl logs -f  spark-connect-c74264f5-server  -n spark-test 
starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-c74264f5-server.out
Spark Command: /opt/java/openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/* -Xmx1g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --master k8s://https://10.96.0.1:443 --conf spark.driver.port=7078 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/connect-name=spark-connect-c74264f5 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.driver.host=10.244.0.14 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=spark-test --conf spark.kubernetes.container.image=apache/spark:3.5.0 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.podNamePrefix=spark-connect-c74264f5 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/connect-name=spark-connect-c74264f5 --conf spark.kubernetes.driver.pod.name=spark-connect-c74264f5-server --conf spark.executor.instances=2 --conf spark.kubernetes.executor.container.image=apache/spark:3.5.0 --class org.apache.spark.sql.connect.service.SparkConnectServer --name Spark Connect server spark-internal
========================================
26/01/29 19:11:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Error: Failed to load class org.apache.spark.sql.connect.service.SparkConnectServer.
Failed to load main class org.apache.spark.sql.connect.service.SparkConnectServer.
You need to specify Spark Connect jars with --jars or --packages.
26/01/29 19:11:01 INFO ShutdownHookManager: Shutdown hook called
26/01/29 19:11:01 INFO ShutdownHookManager: Deleting directory /tmp/spark-1816c8d5-8944-4e69-abb9-e2bf7ee565a6

and

kubectl logs -f  spark-connect-2e2e1386-server  -n spark-test 
starting org.apache.spark.sql.connect.service.SparkConnectServer, logging to /opt/spark/logs/spark--org.apache.spark.sql.connect.service.SparkConnectServer-1-spark-connect-2e2e1386-server.out
Spark Command: /opt/java/openjdk/bin/java -cp /opt/spark/conf:/opt/spark/jars/* -Xmx1g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --master k8s://https://10.96.0.1:443 --conf spark.driver.port=7078 --conf spark.jars.packages=org.apache.spark:spark-connect_2.12:3.5.0 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/connect-name=spark-connect-2e2e1386 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.driver.host=10.244.0.17 --conf spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.namespace=spark-test --conf spark.kubernetes.container.image=apache/spark:3.5.0 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true --conf spark.kubernetes.executor.podNamePrefix=spark-connect-2e2e1386 --conf spark.kubernetes.executor.label.sparkoperator.k8s.io/connect-name=spark-connect-2e2e1386 --conf spark.kubernetes.driver.pod.name=spark-connect-2e2e1386-server --conf spark.executor.instances=2 --conf spark.kubernetes.executor.container.image=apache/spark:3.5.0 --class org.apache.spark.sql.connect.service.SparkConnectServer --name Spark Connect server spark-internal
========================================
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/spark/.ivy2/cache
The jars for the packages stored in: /home/spark/.ivy2/jars
org.apache.spark#spark-connect_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9f61b3e3-3352-46ff-9132-7e2ea8c08393;1.0
	confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /home/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-9f61b3e3-3352-46ff-9132-7e2ea8c08393-1.0.xml (No such file or directory)
	at java.base/java.io.FileOutputStream.open0(Native Method)
	at java.base/java.io.FileOutputStream.open(Unknown Source)
	at java.base/java.io.FileOutputStream.<init>(Unknown Source)
	at java.base/java.io.FileOutputStream.<init>(Unknown Source)
	at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:71)
	at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:63)
	at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:553)
	at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:184)
	at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:259)
	at org.apache.ivy.Ivy.resolve(Ivy.java:522)
	at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1535)
	at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:185)
	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:334)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:964)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

@Shekharrajak
Copy link
Copy Markdown
Member Author

looks like RBAC is needed for various permissions :

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://10.96.0.1:443/api/v1/namespaces/spark-test/pods/spark-connect-9661bdcf-server. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-connect-9661bdcf-server" is forbidden: User "system:serviceaccount:spark-test:default" cannot get resource "pods" in API group "" in the namespace "spark-test".
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:671)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:651)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:597)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:560)
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
	at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
	at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
	at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:137)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)

@Shekharrajak
Copy link
Copy Markdown
Member Author

Shekharrajak commented Jan 29, 2026

In CI spark connect server stuck forever in ready state :

kubectl describe  sparkconnect spark-connect-554dd822 -n spark-test 
Name:         spark-connect-554dd822
Namespace:    spark-test
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1alpha1
Kind:         SparkConnect
Metadata:
  Creation Timestamp:  2026-01-29T19:29:18Z
  Generation:          1
  Resource Version:    12234
  UID:                 33f957ed-f1c9-4534-8043-5d2ab8bc7803
Spec:
  Executor:
    Instances:  2
  Image:        apache/spark:3.5.0
  Server:
  Spark Conf:
    spark.jars:   https://repo1.maven.org/maven2/org/apache/spark/spark-connect_2.12/3.5.0/spark-connect_2.12-3.5.0.jar
  Spark Version:  3.5.0
Status:
  Conditions:
    Last Transition Time:  2026-01-29T19:29:19Z
    Message:               Server pod is ready
    Reason:                ServerPodReady
    Status:                True
    Type:                  ServerPodReady
  Executors:
    Running:         2
  Last Update Time:  2026-01-29T19:29:24Z
  Server:
    Pod Ip:        10.244.0.27
    Pod Name:      spark-connect-554dd822-server
    Service Name:  spark-connect-554dd822-server
  Start Time:      2026-01-29T19:29:18Z
  State:           Ready
Events:            <none>

@Shekharrajak
Copy link
Copy Markdown
Member Author

Github Soark Examples e2e :



    • `spark_connect_simple` (in-cluster sparkconnect created):
      • Level 1: create session (auto name) → client.connect() → spark.range(10).count(), df.show() → spark.stop().
      • Level 2: create session (named my-simple-session-<uuid>) → client.connect() → spark.range(100).count(), df.show(10) → spark.stop().
    • `spark_advanced_options` (in-cluster):
      • One example: create session with labels/annotations → client.connect() → spark.range(100).count() → spark.stop().

SPARK_OPERATOR_IMAGE_TAG: local
timeout-minutes: 15

- name: Build and load Spark E2E runner image (in-cluster)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest (outside cluster) → creates Job → Job Pod runs Python example → SparkClient creates SparkConnect CR → Spark Operator creates server +
  executors → backend waits for READY and builds in-cluster URL → Python connects to server, runs DataFrame ops → script exits → Job completes → test
  collects logs and cleans up.

@Shekharrajak
Copy link
Copy Markdown
Member Author

CI test flow :

Pytest (outside cluster) → creates Job → Job Pod runs Python example → SparkClient creates SparkConnect CR → Spark Operator creates server +
executors → backend waits for READY and builds in-cluster URL → Python connects to server, runs DataFrame ops → script exits → Job completes → test
collects logs and cleans up.

@Shekharrajak
Copy link
Copy Markdown
Member Author

github checks looking good now.

…te_and_connect()

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be in good shape to merge this initial version!
Thanks a lot for this work, exciting to see SparkConnect support in Kubeflow SDK @Shekharrajak 🚀
/lgtm

/assign @astefanutti @kramaranya @Fiona-Waters @szaher @tariq-hasan

@google-oss-prow
Copy link
Copy Markdown
Contributor

@andreyvelich: GitHub didn't allow me to assign the following users: tariq-hasan.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

We should be in good shape to merge this initial version!
Thanks a lot for this work, exciting to see SparkConnect support in Kubeflow SDK @Shekharrajak 🚀
/lgtm

/assign @astefanutti @kramaranya @Fiona-Waters @szaher @tariq-hasan

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added the lgtm label Feb 10, 2026
@astefanutti
Copy link
Copy Markdown
Contributor

Thanks @Shekharrajak for this awesome work!

For reference the follow-up discussion on the ability to pass arbitrary URL w.r.t. to backend addressability: #274

/lgtm

Copy link
Copy Markdown
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Shekharrajak for this incredible work!
/lgtm
I left a few comments/questions but we can open follow up PRs for those :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to move those examples to the website https://sdk.kubeflow.org/en/latest/examples.html
That could be done in a separate PR though

Comment on lines +30 to +51
@abc.abstractmethod
def connect(
self,
num_executors: Optional[int] = None,
resources_per_executor: Optional[dict[str, str]] = None,
spark_conf: Optional[dict[str, str]] = None,
driver: Optional[Driver] = None,
executor: Optional[Executor] = None,
options: Optional[list] = None,
) -> SparkConnectInfo:
"""Create a new SparkConnect session (INTERNAL USE ONLY).

This is an internal method used by SparkClient.connect().
Use SparkClient.connect() instead of calling this directly.

Args:
num_executors: Number of executor instances.
resources_per_executor: Resource requirements per executor.
spark_conf: Spark configuration properties.
driver: Driver configuration.
executor: Executor configuration.
options: List of configuration options (use Name option for custom name).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent that connect() means “create session” on the backend interface or “connect to existing session”?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It creates SparkConnect CRD and connects to that session.

@@ -0,0 +1,123 @@
#!/usr/bin/env python3
# Copyright 2025 The Kubeflow Authors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update this in copyright headers in all new files?

Suggested change
# Copyright 2025 The Kubeflow Authors.
# Copyright 2026 The Kubeflow Authors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also update docs in Kubeflow SDK website to include SparkClient https://sdk.kubeflow.org/en/latest/

@kramaranya
Copy link
Copy Markdown
Contributor

@Shekharrajak could you please rebase the PR?

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
@google-oss-prow google-oss-prow bot removed the lgtm label Feb 12, 2026
@andreyvelich
Copy link
Copy Markdown
Member

/lgtm
/assign @kramaranya

@google-oss-prow google-oss-prow bot added the lgtm label Feb 12, 2026
@kramaranya
Copy link
Copy Markdown
Contributor

Thanks @Shekharrajak! Incredible work!!

/approve
/unhold

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kramaranya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit c6a6c52 into kubeflow:main Feb 12, 2026
18 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.4 milestone Feb 12, 2026
@kramaranya
Copy link
Copy Markdown
Contributor

@Shekharrajak could you please follow up on the items I mentioned #225 (review)? Or at least create github issues to track those?

@Shekharrajak
Copy link
Copy Markdown
Member Author

@Shekharrajak could you please follow up on the items I mentioned #225 (review)? Or at least create github issues to track those?

Let me starting working one by one.

openshift-merge-bot bot pushed a commit to opendatahub-io/kubeflow-sdk that referenced this pull request Mar 11, 2026
* chore!: upgrade to Python 3.10 (kubeflow#282)

This upgrades the minimum Python version for the project from 3.9 to
3.10. Python 3.9 is past end-of-life and dependencies will likely
require a supported version soon.

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* chore: Confirm that a public ConfigMap exists to check version (kubeflow#250)

* Confirm that a public ConfigMap exists to check version

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* python 3.9 fix

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Exceptiom handling better

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Addressing comments

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Update kubeflow/trainer/backends/kubernetes/backend.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>

* Refactored tests into a single function and followed agents.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* CI friendly edit

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* pre-commit format checked

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Modified according to new updates

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Ran pre-commit locally to fix formatting

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* unix2dos CLAUDE.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

* Revert CLAUDE.md

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>

---------

Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore: added sdk docs website to readme (kubeflow#284)

* docs: added sdk docs website to readme

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* format: order of sdk docs

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

---------

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

* feat(trainer): add dataset and model initializer support to container backend (kubeflow#188)

* feat(trainer): add dataset and model initializer support to container backend

Add support for dataset and model initializers in the container backend
to bring it to feature parity with the Kubernetes backend.

Changes:
- Add utility functions for building initializer commands and environment variables
- Implement _run_initializers() and _run_single_initializer() methods in ContainerBackend
- Run initializers sequentially before training containers start
- Download datasets to /workspace/dataset and models to /workspace/model
- Track initializer containers as separate steps in TrainJob
- Support all initializer types: HuggingFace, S3, and DataCache
- Add comprehensive unit tests for all initializer configurations
- Handle initializer failures with proper cleanup and error messages

Fixes kubeflow#171

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* feat(trainer): address reviewer feedback for initializer support

- Make initializer image configurable via ContainerBackendConfig
- Make initializer timeout configurable (default 600 seconds)
- Implement wait API in adapters instead of polling
- Clean up successful initializer containers after completion
- Clean up network on initializer failure
- Raise ValueError for unsupported initializer types (no datacache fallback)

All tests passing (173/173). Addresses all feedback from PR kubeflow#188.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* chore(trainer): add cleanup helper to reduce duplication

Add _cleanup_container_resources() helper method to consolidate
duplicated cleanup logic for stopping/removing containers and
deleting networks. Refactor 5 locations across train(), initializer
handlers, and delete_job() to use this helper.

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(trainer): use correct initializer images and working directory

Address feedback for initializer support in container backend:

- Use separate images for dataset/model initializers:
  - kubeflow/dataset-initializer:latest for datasets
  - kubeflow/model-initializer:latest for models
  (instead of kubeflow/training-operator:latest)

- Update python commands to use pkg.initializers module:
  - python -m pkg.initializers.dataset (for dataset)
  - python -m pkg.initializers.model (for model)

- Change initializer working_dir from /workspace to /app
  per Dockerfile convention

Refs: https://github.com/kubeflow/trainer/tree/master/cmd/initializers
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(container): address PR review comments for initializer support

- Use GHCR images as default for dataset/model initializers
- Replace suppress with try-except blocks
- Refactor initializer utils with ContainerInitializer dataclass
- Add get_dataset_initializer and get_model_initializer functions
- Remove DataCache support (unsupported in container backend)
- Merge initializer tests into test_train() and test_get_job_logs()
- Remove duplicate test functions

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* fix(container): add name field to ContainerInitializer and remove init_type

- Add name field to ContainerInitializer dataclass
- Set name='dataset-initializer' and name='model-initializer' in utils
- Remove init_type parameter from _run_single_initializer()
- Use container_init.name for labels and log messages

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

---------

Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>

* feat: add SparkClient API for SparkConnect session management (kubeflow#225)

* feat(spark): add core types, dataclasses, and constants

- Add SparkConnectInfo, SparkConnectState, Driver, Executor types
- Add type tests for validation
- Add Kubernetes backend constants (CRD group, version, defaults)

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add backend base class and options pattern

- Add RuntimeBackend abstract base class with session lifecycle methods
- Add options pattern (Name, Image, Timeout, etc.) aligned with trainer SDK
- Add validation utilities for connect parameters
- Add comprehensive option tests

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add KubernetesBackend for SparkConnect CRD operations

- Implement KubernetesBackend with create/get/list/delete session methods
- Add port-forward support for out-of-cluster connections
- Add CRD builder utilities and URL validation
- Add comprehensive backend and utils tests with parametrized patterns

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add SparkClient API with KEP-107 compliant connect method

- Implement SparkClient as main user interface for SparkConnect sessions
- Support connect to existing server (base_url) or auto-create new session
- Add public exports for SparkClient, Driver, Executor, options
- Add SparkClient unit tests

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* chore(spark): add test infrastructure and package init files

- Add test common utilities and fixtures
- Add package __init__ files for test directories
- Setup test/e2e/spark structure

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* feat(spark): add example scripts demonstrating SparkClient usage

- Add spark_connect_simple.py with 3 usage levels (minimal, simple, advanced)
- Add spark_advanced_options.py with full configuration examples
- Add connect_existing_session.py for connecting to existing servers
- Add demo and test scripts for local development

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* docs(spark): add documentation for SparkClient and E2E testing

- Add examples/spark/README.md with usage guide
- Add local Spark Connect testing documentation
- Add E2E test README with CI/CD integration guide
- Update KEP-107 proposal documentation

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* test(spark): add E2E test framework with cluster watcher

- Add test_spark_examples.py with example validation tests
- Add cluster_watcher.py for monitoring SparkConnect and pods during tests
- Add run_in_cluster.py for executing examples as K8s Jobs

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* ci(spark): add GitHub Actions workflow and E2E cluster setup

- Add test-spark-examples.yaml workflow for E2E validation
- Add e2e-setup-cluster.sh for Kind cluster with Spark Operator
- Add SparkConnect CRD, Kind config, and E2E runner Dockerfile
- Update Makefile with E2E setup target
- Update PR title check for spark prefix

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* chore(spark): add pyspark[connect] dependency and update lock file

- Add spark extra with pyspark[connect]==3.4.1 for grpcio, pandas, pyarrow
- Update uv.lock with resolved dependencies
- Update .gitignore for spark-related files

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* Update kubeflow/spark/backends/base.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>

* refactor(spark): rename backend.connect_session() to connect()

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

* refactor: move session creation flow from SparkClient to backend.create_and_connect()

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>

---------

Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore: bump minimum model-registry version to 0.3.6 (kubeflow#289)

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* fix: Improve CVE workflow (kubeflow#267)

* fix: Improve CVE workflow

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: fix issue with bash compare

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* feat: Add workflow to cleanup overrides in pyproject.toml

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: address review comments

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore: refactor to reduce size of cve related workflows

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

---------

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore: upgrade code style for python3.10 (kubeflow#288)

* chore: update code style for Python 3.10

This disables a couple ruff rules in pyproject.toml:
```
"UP007", # Use X | Y instead of Union[X, Y] (requires Python 3.10+)
"UP045", # Use X | None instead of Optional[X] (requires Python 3.10+)
```

Then the code changes are made with:
```
uv run ruff check --fix
uv run ruff format
```

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* fix: handle unions, bools in convert_value

The convert_value function didn't seems to be handling union types
properly and also needs to handle `T | None` similarly to
`Optional[None]` after the upgrade to Python 3.10. This fixes union
types, an issue with bool conversion, and adds tests for this function.

Signed-off-by: Jon Burdo <jon@jonburdo.com>

---------

Signed-off-by: Jon Burdo <jon@jonburdo.com>

* chore(ci): bump astral-sh/setup-uv from 5 to 7 (kubeflow#276)

Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 5 to 7.
- [Release notes](https://github.com/astral-sh/setup-uv/releases)
- [Commits](astral-sh/setup-uv@v5...v7)

---
updated-dependencies:
- dependency-name: astral-sh/setup-uv
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the python-minor group across 1 directory with 4 updates (kubeflow#291)

Bumps the python-minor group with 4 updates in the / directory: [coverage](https://github.com/coveragepy/coveragepy), [ruff](https://github.com/astral-sh/ruff), [pre-commit](https://github.com/pre-commit/pre-commit) and [ty](https://github.com/astral-sh/ty).


Updates `coverage` from 7.10.7 to 7.13.4
- [Release notes](https://github.com/coveragepy/coveragepy/releases)
- [Changelog](https://github.com/coveragepy/coveragepy/blob/main/CHANGES.rst)
- [Commits](coveragepy/coveragepy@7.10.7...7.13.4)

Updates `ruff` from 0.14.14 to 0.15.0
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.14.14...0.15.0)

Updates `pre-commit` from 4.3.0 to 4.5.1
- [Release notes](https://github.com/pre-commit/pre-commit/releases)
- [Changelog](https://github.com/pre-commit/pre-commit/blob/main/CHANGELOG.md)
- [Commits](pre-commit/pre-commit@v4.3.0...v4.5.1)

Updates `ty` from 0.0.14 to 0.0.16
- [Release notes](https://github.com/astral-sh/ty/releases)
- [Changelog](https://github.com/astral-sh/ty/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ty@0.0.14...0.0.16)

---
updated-dependencies:
- dependency-name: coverage
  dependency-version: 7.13.4
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: ruff
  dependency-version: 0.15.0
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: pre-commit
  dependency-version: 4.5.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
  dependency-group: python-minor
- dependency-name: ty
  dependency-version: 0.0.16
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Added examples to the documentation demonstrating different ways to handle ports (kubeflow#243)

* update docs and add test cases.

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>

* pre-commit error solved

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>

* Update kubeflow/hub/api/model_registry_client.py

Co-authored-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

* readme updated

Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

* Refactor model registry client test cases for clarity

Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>

---------

Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>

* chore(ci): bump peter-evans/create-pull-request from 6 to 8 (kubeflow#277)

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 6 to 8.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](peter-evans/create-pull-request@v6...v8)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(ci): bump actions/checkout from 4 to 6 (kubeflow#278)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: Adds a GitHub Actions workflow to check kubeflow/hub/OWNERS. (kubeflow#280)

* Add OWNERS validation

Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>

* Update .github/workflows/check-owners.yaml

Co-authored-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

* Update OWNERS file check in workflow

Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

* Update paths in check-owners workflow

Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>

---------

Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>

* fix: nightly security dependency updates (kubeflow#296)

Co-authored-by: google-oss-prow <92114575+google-oss-prow@users.noreply.github.com>

* chore(ci): bump aquasecurity/trivy-action from 0.33.1 to 0.34.0 in the actions group (kubeflow#297)

Bumps the actions group with 1 update: [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action).


Updates `aquasecurity/trivy-action` from 0.33.1 to 0.34.0
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.33.1...0.34.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.34.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump pytest from 8.4.2 to 9.0.2 (kubeflow#301)

Bumps [pytest](https://github.com/pytest-dev/pytest) from 8.4.2 to 9.0.2.
- [Release notes](https://github.com/pytest-dev/pytest/releases)
- [Changelog](https://github.com/pytest-dev/pytest/blob/main/CHANGELOG.rst)
- [Commits](pytest-dev/pytest@8.4.2...9.0.2)

---
updated-dependencies:
- dependency-name: pytest
  dependency-version: 9.0.2
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(trainer): Support namespaced TrainingRuntime in the SDK (kubeflow#130)

* feat(backend): Support namespaced TrainingRuntime in the SDK

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed bugs and validated current test cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed pre-commit test failure

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Addressed comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed no attribute 'DEFAULT_TIMEOUT' error

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Added namespace-scoped runtime to test cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Addressed fallback logic bugs

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Added scope field to Runtime

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Improved code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed copilot's comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Shadow duplicate runtimes, priority to ns

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed bug

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Fixed copilot comments

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Improved test cases to validate all possible cases

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* small fix

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* lint fix

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* improved error message

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>

* refactored code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* improve code

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* Removed RuntimeScope

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

* removed scope references and improved error handling as per kubeflow standards

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>

---------

Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: Fix runtime lookup fallback and test local SDK in E2E (kubeflow#307)

* fix: Install SDK locally in E2E workflow and improve error handling for runtime fetching in Kubernetes backend.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* refactor: Explicitly return errors from  and refine exception handling in .

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* docs: update comment to clarify Kubeflow SDK installation from source in e2e workflow.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* feat: Enhance runtime retrieval tests to cover Kubernetes API 404/403 errors and partial success for list operations on timeout.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* refactor: Update runtime listing to immediately raise exceptions on failure instead of collecting partial results.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

---------

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>

* chore(ci): bump actions/setup-python from 5 to 6 (kubeflow#298)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 5 to 6.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v5...v6)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chore(deps): bump the python-minor group with 2 updates (kubeflow#299)

Bumps the python-minor group with 2 updates: [ruff](https://github.com/astral-sh/ruff) and [ty](https://github.com/astral-sh/ty).


Updates `ruff` from 0.15.0 to 0.15.1
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ruff@0.15.0...0.15.1)

Updates `ty` from 0.0.16 to 0.0.17
- [Release notes](https://github.com/astral-sh/ty/releases)
- [Changelog](https://github.com/astral-sh/ty/blob/main/CHANGELOG.md)
- [Commits](astral-sh/ty@0.0.16...0.0.17)

---
updated-dependencies:
- dependency-name: ruff
  dependency-version: 0.15.1
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
- dependency-name: ty
  dependency-version: 0.0.17
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: python-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix: improve logging around packages_to_install (kubeflow#269)

* improve logging around packages_to_install

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* exit when pip install fails, append errors from both attempts

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* Add shlex to address command injection vulnerabilities. Write pip install logfile to cwd

Signed-off-by: Brian Gallagher <briangal@gmail.com>

---------

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* feat: Add validate lockfile workflow to complement CVE scanning (kubeflow#306)

* feat: Add validate lockfile workflow to complement CVE scanning

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix: make cve fix pr branch static

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

---------

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* fix(trainer): handle falsy values in get_args_from_peft_config (kubeflow#328)

* fix(trainer): handle falsy values in get_args_from_peft_config

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: apply pre-commit formatting

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: also handle falsy train_on_input in dataset_preprocess_config

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: add missing newline at end of utils_test.py

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix: pre-commit formatting

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

---------

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

* fix(optimizer): prevent input mutation in optimize() (kubeflow#322)

* fix(optimizer): prevent input mutation in optimize()

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* remove unnecessary things

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* rename test

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

---------

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* feat: add TrainerClient examples for local PyTorch distributed training (kubeflow#312)

* docs: add TrainerClient examples for local PyTorch distributed training

- Add examples/trainer/pytorch_distributed_simple.py
- Add examples/trainer/README.md
- Demonstrates LocalProcessBackend usage without Kubernetes
- Fixes kubeflow#218

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* docs: add training examples table to SDK website

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* docs: expand examples table with PyTorch, MLX, DeepSpeed, and TorchTune examples grouped by framework

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

---------

Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>

* chore: fix docstrings in TrainerClient (kubeflow#333)

Signed-off-by: Transcendental-Programmer <priyena.programming@gmail.com>

* feat(spark): Refactor unit tests to sdk coding standards  (kubeflow#293)

* Refactored unit test

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Changes made

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Version

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Restructured clien_test

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* reformated backend_test.py

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* revert pyproject.toml and uv.lock changes

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* Standarized spark backend tests

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* backend_tests

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

---------

Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>

* fix(optimizer): add missing get_job_events() to RuntimeBackend base c… (kubeflow#325)

* fix(optimizer): add missing get_job_events() to RuntimeBackend base class

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* Update kubeflow/optimizer/backends/base.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* Update kubeflow/optimizer/backends/base.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* fix: add abstractmethod, remove docstrings

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* make get_job_events abstract in RuntimeBackend

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

* Update kubeflow/trainer/backends/localprocess/backend.py

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>

* fix

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>

---------

Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* chore(spark): migrate SDK to kubeflow_spark_api Pydantic models (kubeflow#295)

* chore(spark): add kubeflow-spark-api dependency

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate options to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate utils to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): migrate backend to typed Pydantic models

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): refactor tests to use typed models and cleanup

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): rename build_spark_connect_crd to build_spark_connect_cr

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* fix(spark): use typed model helpers in mock handlers

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* chore(spark): bump kubeflow-spark-api to 2.4.0

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

---------

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* feat(docs): Update README with Spark Support  (kubeflow#349)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* fix(trainer): return TRAINJOB_COMPLETE when all steps are done (kubeflow#340)

* fix(local): return TRAINJOB_COMPLETE when all steps are done (kubeflow#338)

Signed-off-by: priyank <priyank8445@gmail.com>

* test(trainer): add test case for __get_job_status

Signed-off-by: priyank <priyank8445@gmail.com>

* fix(trainer): early return TRAINJOB_CREATED when job has no steps

Signed-off-by: priyank <priyank8445@gmail.com>

* test(trainer): refactor test_get_job_status with TestCase fixture

Signed-off-by: priyank <priyank8445@gmail.com>

---------

Signed-off-by: priyank <priyank8445@gmail.com>

* fix(trainer): adapt SDK to removal of numProcPerNode from TorchMLPolicySource (kubeflow#360)

Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>

* fix: Make validate-lockfile action non-blocking (kubeflow#361)

Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>

* chore(spark): change pyspark[connect] dependency (kubeflow#357)

Change pyspark[connect] 3.4.1 dependency to pyspark-connect 4.0.1.

This matches the version of Spark in the spark-operator container image
(https://github.com/kubeflow/spark-operator/blob/master/Dockerfile#L17).

Signed-off-by: Ali Maredia <amaredia@redhat.com>

* chore(spark): remove SDK-side validation from SparkClient (kubeflow#345)

Remove all SDK-side input validation from the spark module.
Validation will be handled server-side by the Spark Operator
admission webhooks (spark-operator#2862).

- Remove validation.py and validation_test.py
- Remove isinstance checks from _create_session()
- Remove ValidationError from public API

Closes: kubeflow#272

Signed-off-by: Yassin Nouh <yassinnouh21@gmail.com>
Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>

* chore: Merge upstream/main (preserving downstream config)

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* update workflow to skip requirements generation on merge conflict

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* remove compatibility with python 3.9 and udpated tests

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* fix tests

Signed-off-by: Brian Gallagher <briangal@gmail.com>

---------

Signed-off-by: Jon Burdo <jon@jonburdo.com>
Signed-off-by: Surya Sameer Datta Vaddadi <f20220373@goa.bits-pilani.ac.in>
Signed-off-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: HKanoje <hrithik.kanoje@gmail.com>
Signed-off-by: Shekhar Rajak <shekharrajak@live.com>
Signed-off-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Signed-off-by: Fiona-Waters <fiwaters6@gmail.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: osamaahmed17 <osamaahmedtahir17@gmail.com>
Signed-off-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Signed-off-by: muhammadjunaid8047 <muhammadjunaid8047@gmail.com>
Signed-off-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Signed-off-by: Moeed Shaik <shaikmoeed@gmail.com>
Signed-off-by: Moeed <shaikmoeed@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: Brian Gallagher <briangal@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: ruskaruma <ishaan.sinha10@gmail.com>
Signed-off-by: Mansi Singh <singh.m1@northeastern.edu>
Signed-off-by: Transcendental-Programmer <priyena.programming@gmail.com>
Signed-off-by: digvijay-y <yewaredigvijay@gmail.com>
Signed-off-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Signed-off-by: tariq-hasan <mmtariquehsn@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: priyank <priyank8445@gmail.com>
Signed-off-by: Ali Maredia <amaredia@redhat.com>
Signed-off-by: Yassin Nouh <yassinnouh21@gmail.com>
Signed-off-by: yassinnouh21 <yassinnouh21@gmail.com>
Co-authored-by: Jon Burdo <jon@jonburdo.com>
Co-authored-by: Surya Sameer Datta Vaddadi <137607947+sameerdattav@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Co-authored-by: Hrithik Kanoje <128607033+HKanoje@users.noreply.github.com>
Co-authored-by: Shekhar Prasad Rajak <5774448+Shekharrajak@users.noreply.github.com>
Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Osama Tahir <31954609+osamaahmed17@users.noreply.github.com>
Co-authored-by: Muhammad Junaid <muhammadjunaid8047@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: google-oss-prow <92114575+google-oss-prow@users.noreply.github.com>
Co-authored-by: Moeed <shaikmoeed@gmail.com>
Co-authored-by: Yash Agarwal <2004agarwalyash@gmail.com>
Co-authored-by: krishdef7 <157892833+krishdef7@users.noreply.github.com>
Co-authored-by: Ruskaruma <154019945+ruskaruma@users.noreply.github.com>
Co-authored-by: Mansi Singh <mansimaanu8627@gmail.com>
Co-authored-by: Priyansh Saxena <130545865+priyansh-saxena1@users.noreply.github.com>
Co-authored-by: DIGVIJAY <144053736+digvijay-y@users.noreply.github.com>
Co-authored-by: Tariq Hasan <mmtariquehsn@gmail.com>
Co-authored-by: Priyank Patel <147739348+priyank766@users.noreply.github.com>
Co-authored-by: Ali Maredia <amaredia@redhat.com>
Co-authored-by: Yassin Nouh <70436855+YassinNouh21@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants