-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We are working on adding support for Spark Operator in the Kubeflow SDK.
Our initial goal is to provide a SparkClient().connect() API. This API will create a SparkConnect CR, enabling users to connect to Spark clusters and execute PySpark queries directly from the SDK.
Next, we plan to introduce a submit_job() API that will allow users to submit batch Spark jobs by creating SparkApplication CR.
Proposal: https://github.com/kubeflow/sdk/tree/main/docs/proposals/107-spark-client
Since the Spark Operator does not publish its APIs to PyPI, we currently have to construct CRs using raw dictionaries and pass them into create_namespaced_custom_object(): kubeflow/sdk#225 (comment)
We should move to a structured, model-based approach using generated API models from the Spark Operator – similar to how this is done for Trainer and Katib.
The goal it to create kubeflow_spark_api package under api/python_api directory, and publish it to PyPI. Example from Trainer:
Ref discussion: https://youtu.be/WsOPaeXxtkA?t=1089
cc @kubeflow/kubeflow-sdk-team @Shekharrajak @vara-bonthu @nabuskey @ChenYi015