Skip to content

Publish Spark Operator APIs to PyPI #2818

@andreyvelich

Description

@andreyvelich

We are working on adding support for Spark Operator in the Kubeflow SDK.

Our initial goal is to provide a SparkClient().connect() API. This API will create a SparkConnect CR, enabling users to connect to Spark clusters and execute PySpark queries directly from the SDK.
Next, we plan to introduce a submit_job() API that will allow users to submit batch Spark jobs by creating SparkApplication CR.

Proposal: https://github.com/kubeflow/sdk/tree/main/docs/proposals/107-spark-client

Since the Spark Operator does not publish its APIs to PyPI, we currently have to construct CRs using raw dictionaries and pass them into create_namespaced_custom_object(): kubeflow/sdk#225 (comment)

We should move to a structured, model-based approach using generated API models from the Spark Operator – similar to how this is done for Trainer and Katib.

The goal it to create kubeflow_spark_api package under api/python_api directory, and publish it to PyPI. Example from Trainer:

Ref discussion: https://youtu.be/WsOPaeXxtkA?t=1089

cc @kubeflow/kubeflow-sdk-team @Shekharrajak @vara-bonthu @nabuskey @ChenYi015

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions