-
Notifications
You must be signed in to change notification settings - Fork 5
KEP-2: Local Execution Mode Proposal #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enable Kubeflow Trainer users to run train jobs locally first before submitting it to Kubeflow Trainer. Fixes kubeflow#2 Signed-off-by: Saad Zaher <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great work! I left a few comments.
|
||
## Summary | ||
|
||
This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. The feature will enable ML engineers to use Subprocess, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. This local execution mode will allow for rapid prototyping, debugging, and validation of training jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning security too? Running untested or experimental code directly on the cluster can introduce security risks.
|
||
### Notes/Constraints/Caveats | ||
- The local execution mode will work only with Podman, Docker and Subporcess. | ||
- The subprocess implementation will be restricted to single node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we clarify what subprocess
means here?
|
||
The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The png is a bit heavy, could we convert to svg?
|
||
## Proposal | ||
|
||
The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Shall we throw in a tiny code sample that calls LocalTrainerClient().train()
?
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user choose to use Podman or Docker?
- The **PodmanJobClient** will manage Podman containers, networks, and volumes using runtime definitions specified by the user. | ||
- Containers will be labeled with job IDs, making it possible to track job status and logs. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arrow between Trainer SDK
and Initialize Container Based Trainer Client
points upward. I think the flow should start at Trainer SDK and go down into the init step, shouldn't it?
As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. | ||
|
||
### Notes/Constraints/Caveats | ||
- The local execution mode will work only with Podman, Docker and Subporcess. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The local execution mode will work only with Podman, Docker and Subporcess. | |
- The local execution mode will work only with Podman, Docker and Subprocess. |
Enable Kubeflow Trainer users to run train jobs locally first before submitting it to Kubeflow Trainer.
Fixes #2
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #2
Checklist: