KEP-2: Local Execution Mode Proposal #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

szaher wants to merge 1 commit into kubeflow:main from szaher:local-exec-proposal

Contributor

szaher commented Apr 29, 2025 •

edited

Loading

Enable Kubeflow Trainer users to run train jobs locally first before submitting it to Kubeflow Trainer.

Fixes #2

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2

Checklist:

Docs included if any changes are user facing


          KEP-2: Local Execution Mode Proposal

7c14c81

Enable Kubeflow Trainer users to run train jobs locally first
before submitting it to Kubeflow Trainer.

Fixes kubeflow#2

Signed-off-by: Saad Zaher <[email protected]>

google-oss-prow bot commented Apr 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from andreyvelich, Electronic-Waste and tenzen-y

April 29, 2025 21:53

google-oss-prow bot added the size/M label

kramaranya reviewed

View reviewed changes

Contributor

kramaranya left a comment

Thanks, great work! I left a few comments.

proposals/2-trainer-local-execution/README.md


		## Summary

		This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. The feature will enable ML engineers to use Subprocess, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. This local execution mode will allow for rapid prototyping, debugging, and validation of training jobs.

Contributor

kramaranya May 7, 2025

Worth mentioning security too? Running untested or experimental code directly on the cluster can introduce security risks.

proposals/2-trainer-local-execution/README.md

+              ### Notes/Constraints/Caveats
+              - The local execution mode will work only with Podman, Docker and Subporcess.
+              - The subprocess implementation will be restricted to single node.

Contributor

kramaranya May 7, 2025

Could we clarify what subprocess means here?

proposals/2-trainer-local-execution/README.md


		The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.

		![Architecture Diagram](high-level-arch.png)

Contributor

kramaranya May 7, 2025

The png is a bit heavy, could we convert to svg?

proposals/2-trainer-local-execution/README.md


		## Proposal

		The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.

Contributor

kramaranya May 7, 2025

Cool! Shall we throw in a tiny code sample that calls LocalTrainerClient().train()?

proposals/2-trainer-local-execution/README.md

+              ### User Stories (Optional)
+              #### Story 1
+              As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

Contributor

kramaranya May 7, 2025

How does the user choose to use Podman or Docker?

proposals/2-trainer-local-execution/README.md

+              - The **PodmanJobClient** will manage Podman containers, networks, and volumes using runtime definitions specified by the user.
+              - Containers will be labeled with job IDs, making it possible to track job status and logs.
+              ![Detailed Workflow](detailed-workflow.png)

Contributor

kramaranya May 7, 2025

The arrow between Trainer SDK and Initialize Container Based Trainer Client points upward. I think the flow should start at Trainer SDK and go down into the init step, shouldn't it?

proposals/2-trainer-local-execution/README.md

+              As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
+              ### Notes/Constraints/Caveats
+              - The local execution mode will work only with Podman, Docker and Subporcess.

Contributor

kramaranya May 7, 2025

Suggested change

      
            - The local execution mode will work only with Podman, Docker and Subporcess.
          
            - The local execution mode will work only with Podman, Docker and Subprocess.

szaher closed this

szaher deleted the local-exec-proposal branch

June 13, 2025 15:36

Member

andreyvelich commented Jun 14, 2025

Hi @szaher, shall we keep this KEP open ?
I would like to spend time to review it sometime next week.

Member

andreyvelich commented Jun 17, 2025

/reopen

google-oss-prow bot commented Jun 17, 2025

@andreyvelich: Failed to re-open PR: state cannot be changed. The local-exec-proposal branch has been deleted.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor Author

szaher commented Jun 21, 2025

/reopen

google-oss-prow bot commented Jun 21, 2025

@szaher: Failed to re-open PR: state cannot be changed. The local-exec-proposal branch was force-pushed or recreated.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Contributor Author

szaher commented Jun 21, 2025

/reopen

google-oss-prow bot commented Jun 21, 2025

@szaher: Failed to re-open PR: state cannot be changed. The local-exec-proposal branch was force-pushed or recreated.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels