Skip to content

[Multi-k8s] Use different configuration for different k8s contexts #5353

Open
@Michaelvll

Description

@Michaelvll

Discussed in #5352

Originally posted by SchKng April 24, 2025
Hi all,

I'm currently setting up skypilot for our org. and I'm running into some limitations with the way kubernetes is handled at the config.yaml level.

Our setup

Here's a summary of our setup:

  • 3 kubernetes clusters
    • GKE (on GCP)
    • EKS (on AWS)
    • K3S (on-prem, deployed with sky local up)
  • Sky API server deployed remotely (multi-user teams)

Kubernetes contexts management

As for as I understand, kubernetes is handled as a single cloud through the config.yaml file.
Each kubernetes cluster has to be added in the allowed_contexts of the kubernetes section of the config.

I'm going to take the example of the autoscaling configuration to illustrate my point.
If I want to enable the autoscaling feature, I can only do it at the "kubernetes" level in the config.yaml deployed in the API server:

kubernetes:
  allowed_contexts:
    - gke_context
    - eks_context
    - on_prem_context
  provision_timeout: 900
  autoscaler: gke

Let's say I want to force a job / cluster to run on our EKS cluster (--cloud k8s --region eks_context).
As it stands, given the previous config on the remote API server, it will try to use the GKE autoscaling feature to try to provision and fail.

Of course, I could override the configuration through the cli or through my local config.yaml (as specified here).
However, as an admin of the whole thing, I'd like to be able to manage all this within the config.yaml of the remote API server, not hope that users don't forget changing or overriding the config !

Benefits & design proposal

This could also have other benefits such as specifying a different service account / custom_metadata / ... for each k8s context, basically all of the kubernetes options offered in the config.yaml.

A proposal could be to have something like:

kubernetes:
  gke_context:
    autoscaler: gke
    provision_timeout: 600
    remote_identity: my-gke-service-account
  eks_context:
    autoscaler: xxx
    remote_identity: my-eks-service-account
  on_prem_context:
    autoscaler: none
    provision_timeout: 300

Please let me know if I missed something or if there's another way to do that.
Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions