Publish Spark Operator APIs to PyPI

We are working on adding support for Spark Operator in the Kubeflow SDK.

Our initial goal is to provide a `SparkClient().connect()` API. This API will create a `SparkConnect` CR, enabling users to connect to Spark clusters and execute PySpark queries directly from the SDK.
Next, we plan to introduce a `submit_job()` API that will allow users to submit batch Spark jobs by creating `SparkApplication` CR.

Proposal: https://github.com/kubeflow/sdk/tree/main/docs/proposals/107-spark-client

Since the Spark Operator does not publish its APIs to PyPI, we currently have to construct CRs using raw dictionaries and pass them into `create_namespaced_custom_object()`: https://github.com/kubeflow/sdk/pull/225#discussion_r2710515921

We should move to a structured, model-based approach using generated API models from the Spark Operator – similar to how this is done for [Trainer and Katib](https://github.com/kubeflow/sdk/blob/main/pyproject.toml#L32-L33).

The goal it to create `kubeflow_spark_api` package under `api/python_api` directory, and publish it to PyPI. Example from Trainer:

- [Script to generate Python APIs](https://github.com/kubeflow/trainer/blob/master/hack/python-api/gen-api.sh)
- [Python package](https://github.com/kubeflow/trainer/tree/master/api/python_api)

Ref discussion: https://youtu.be/WsOPaeXxtkA?t=1089

cc @kubeflow/kubeflow-sdk-team @Shekharrajak @vara-bonthu @nabuskey @ChenYi015 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish Spark Operator APIs to PyPI #2818

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Publish Spark Operator APIs to PyPI #2818

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions