Skip to content

KEP-2: Local Execution Mode Proposal #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

szaher
Copy link

@szaher szaher commented Apr 29, 2025

Enable Kubeflow Trainer users to run train jobs locally first before submitting it to Kubeflow Trainer.

Fixes #2

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2

Checklist:

  • Docs included if any changes are user facing

Enable Kubeflow Trainer users to run train jobs locally first
before submitting it to Kubeflow Trainer.

Fixes kubeflow#2

Signed-off-by: Saad Zaher <[email protected]>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great work! I left a few comments.


## Summary

This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. The feature will enable ML engineers to use Subprocess, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. This local execution mode will allow for rapid prototyping, debugging, and validation of training jobs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning security too? Running untested or experimental code directly on the cluster can introduce security risks.


### Notes/Constraints/Caveats
- The local execution mode will work only with Podman, Docker and Subporcess.
- The subprocess implementation will be restricted to single node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify what subprocess means here?


The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.

![Architecture Diagram](high-level-arch.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The png is a bit heavy, could we convert to svg?


## Proposal

The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Shall we throw in a tiny code sample that calls LocalTrainerClient().train()?

### User Stories (Optional)

#### Story 1
As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the user choose to use Podman or Docker?

- The **PodmanJobClient** will manage Podman containers, networks, and volumes using runtime definitions specified by the user.
- Containers will be labeled with job IDs, making it possible to track job status and logs.

![Detailed Workflow](detailed-workflow.png)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arrow between Trainer SDK and Initialize Container Based Trainer Client points upward. I think the flow should start at Trainer SDK and go down into the init step, shouldn't it?

As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.

### Notes/Constraints/Caveats
- The local execution mode will work only with Podman, Docker and Subporcess.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- The local execution mode will work only with Podman, Docker and Subporcess.
- The local execution mode will work only with Podman, Docker and Subprocess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Local Execution Mode
2 participants