Description
Description
Using replicas
for repetitive pod configuration in kubernetes_scheduler
has been removed in f6907e8
The rationale is here
Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes
, e.g. for AWS EKS.
Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.
We would like to find a way to make replicas work to minimize job manifest size.
Motivation/Background
Increase the maximum cluster size we can support with k8s
Detailed Proposal
E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.
Alternatives
Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.