Skip to content

Want mechanism to ensure that a sled can be vacated of workloads #2944

@gjcolombo

Description

@gjcolombo

Upgrading a sled requires it to be vacated of instances and services. This requires that, at any time,

  • A rack must always have enough aggregate free capacity to arrange to hold the contents of any individual sled, and
  • For any sled S, it must always be possible to arrange the workloads on other sleds in the rack so that there is an acceptable destination for each workload running on S. (In other words, the available space can't become fragmented such that there's a sled's worth of space in aggregate, but there's no way to gather the free space onto a single sled to allow a large workload to land there.)

(All this applies equally to multi-rack deployments; the important thing is that if a workload can land in some domain, there must be enough possibly-contiguous capacity to empty a sled in that domain.)

If these properties don't hold, users or operators will have to stop workloads to take a sled out of service. We would prefer to avoid this, at least in cases where all sleds are operating normally and nothing has failed.

Instance provisioning doesn't currently guarantee either property. It only ensures that an instance will land on a sled that has space available for it without taking global usage into account. Even if Nexus did track domain-wide resource usage, we would have to discover and implement a bin-packing scheme that preserves our fragmentation properties. That seems difficult, though maybe this is a solved problem whose solution I don't know.

In the short term, it may be simplest to address this by allowing a sled to be put into a state where it's only eligible to receive migratory workloads and suggesting that operators use this mechanism to keep a sled in reserve for updates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    customerFor any bug reports or feature requests tied to customer requests

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions