Description
Discussed in #5352
Originally posted by SchKng April 24, 2025
Hi all,
I'm currently setting up skypilot for our org. and I'm running into some limitations with the way kubernetes is handled at the config.yaml
level.
Our setup
Here's a summary of our setup:
- 3 kubernetes clusters
- GKE (on GCP)
- EKS (on AWS)
- K3S (on-prem, deployed with
sky local up
)
- Sky API server deployed remotely (multi-user teams)
Kubernetes contexts management
As for as I understand, kubernetes is handled as a single cloud through the config.yaml
file.
Each kubernetes cluster has to be added in the allowed_contexts
of the kubernetes section of the config.
I'm going to take the example of the autoscaling configuration to illustrate my point.
If I want to enable the autoscaling feature, I can only do it at the "kubernetes" level in the config.yaml
deployed in the API server:
kubernetes:
allowed_contexts:
- gke_context
- eks_context
- on_prem_context
provision_timeout: 900
autoscaler: gke
Let's say I want to force a job / cluster to run on our EKS cluster (--cloud k8s --region eks_context
).
As it stands, given the previous config on the remote API server, it will try to use the GKE autoscaling feature to try to provision and fail.
Of course, I could override the configuration through the cli or through my local config.yaml
(as specified here).
However, as an admin of the whole thing, I'd like to be able to manage all this within the config.yaml
of the remote API server, not hope that users don't forget changing or overriding the config !
Benefits & design proposal
This could also have other benefits such as specifying a different service account / custom_metadata / ... for each k8s context, basically all of the kubernetes options offered in the config.yaml
.
A proposal could be to have something like:
kubernetes:
gke_context:
autoscaler: gke
provision_timeout: 600
remote_identity: my-gke-service-account
eks_context:
autoscaler: xxx
remote_identity: my-eks-service-account
on_prem_context:
autoscaler: none
provision_timeout: 300
Please let me know if I missed something or if there's another way to do that.
Thanks a lot!