Skip to content

Use k8s volcano replicas to shrink job manifest size #1054

Open
@clumsy

Description

@clumsy

Description

Using replicas for repetitive pod configuration in kubernetes_scheduler has been removed in f6907e8

The rationale is here

Unfortunately for a large setup we can easily breach default limits, 1.5Mb: etcdserver: request is too large
It's not always possible to bump max-request-bytes, e.g. for AWS EKS.

Currently both job-specific and even TorchX own environment variables are contributing to breaching this limit.

We would like to find a way to make replicas work to minimize job manifest size.

Motivation/Background

Increase the maximum cluster size we can support with k8s

Detailed Proposal

E.g. using ConfigMap with per node/role config or Downward API. Make use of the fact we have roles with many replicas that share a huge chunk of their configuration.

Alternatives

Don't use environment variables and long names anywhere in the configuration, still the limit will be significantly smaller than when using replicas on average.

Additional context/links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions