Description
There are several issues related to "who is using CPU / RAM / Disk space" on a sled, and ensuring these are accounted properly. This issue attempts to summarize these issues at a high-level, and track them in one place.
If we do not account for these resources, it's possible for anything consuming these resources in an unbounded fashion to exhaust other consumers of said resource on the sled, which could result in unexpected failures or performance degredation.
Definitions
Resources
Resources are finite resources that exist on Sleds, and are necessary for zones to operates. They are used by zones, but may be used by other non-zone entities as well.
These include:
- CPUs
- RAM (both reservoir and non-reservoir usage)
- Disk Space on datasets (used by both durable datasets and transient zone filesystems). The focus of this issue is on "disk space usage within U.2s specifically".
Resource Consumers
Consumers are entities on Sleds that utilize resources, and draw from the "shared pool" of them.
These include:
- The Host OS + Global Zone
- All control-plane zones allocated from the blueprint (Nexus, Crucible, Cockroachdb, DNS, NTP, etc)
- All control-plane zones that are allocated from the sled (Switch Zone)
- All control-plane zones that are allocated on-demand (Propolis, Probe Zones)
Why Use This Categorization
If you pick a consumer (e.g. "Switch Zone") and a resource (e.g., "RAM"), and there exists no upper-bound on usage, then it is possible for that consumer to negatively impact other occupants on that sled by preventing them from using resources that they expect to exist.
To definitely resolve this issue, we must define "buckets" from which consumers can access resources. One such example: For disks, the Debug dataset has a reservation and a quota. Although we must account for space usage within this dataset, it is not possible for usage within this dataset to cause problems in other datasets. Similarly, usage of space by other datasets cannot starve the debug dataset of space.
Tools to limit resource usage
illumos gives us tools for providing upper-bounds on the usage of resources by consumers
- CPUs / RAM usage can be controlled by the
capped-cpu
andcapped-memory
properties of zonecfg - Disk space can be controlled on a per-dataset basis by using quotas and reservations
Resource Limits by Consumer
- Host OS / Global Zone
- Disk Usage: The host OS/GZ generally make usage of the rpool ramdisk, as well as M.2s. They do not really have any unbounded space usage on U.2 disks.
- CPU + RAM usage: Unbounded. It may be difficult to make an explicit bound here, and easier to "put a capacity on all other zones" instead. Whatever pool we allocate from, the remainder would then be dedicated to the host OS.
- Blueprint-Controlled Zones (see: ZoneKind and DatasetKind)
- Boundary + Internal NTP
- Disk Usage from transient filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Clickhouse
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Clickhouse Keeper
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Clickhouse Server
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Unbounded
- CPU/RAM usage: Unbounded
- CockroachDB
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Crucible
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Nexus considers and updates a "size used" column within each Crucible dataset, and tries to make this usage less than the size of the entire zpool. However, this size is considered relative to the entire zpool (where other datasets may exist!), and nothing accounts for the metadata used by Crucible itself (in addition to the user-requested storage). Furthermore, there is no bound set on the durable dataset itself. However, Crucible does have an internal dataset called
regions
, where it does apply quotas and reservations internally. - CPU/RAM usage: Unbounded
- Crucible Pantry
- Disk Usage from transient filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Internal + External DNS
- Disk Usage from transient filesystem: Unbounded
- Disk Usage from durable filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Nexus
- Disk Usage from transient filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Oximeter
- Disk Usage from transient filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Boundary + Internal NTP
- Sled-Agent Controlled Zones
- Switch Zone
- RAM Usage from transient filesystem: Unbounded. Take note! This is one of the few zones to occupy the ramdisk, so the transient filesystem is a consumer of RAM, not U.2 space.
- CPU/RAM usage: Unbounded
- Switch Zone
- Dynamically-Provisioned Zones
- Propolis (Note that the Propolis zone must consume resources in addition to the bounds on the underlying instance)
- Disk Usage from transient filesystem: Unbounded
- CPU usage: We supply a value of
vcpus
for the instance, but impose no bound on the Propolis Zone. - Reservoir RAM usage: We give values for
memory
which is allocated within Nexus, and used by Propolis to cooperatively use a portion of an instance-provisioned "memory reservoir". Reservoir capacity is cooperatively shared by the propolis zones (the sled agent is trusting zones to only consume as much reservoir as they were provided - a "greedy" propolis could starve other instances, although this seems unlikely). - Non-Reservoir RAM usage: We do not consider this amount (we pretend this usage is zero in the allocation, but that is not true) and then don't bound it.
- Probe Zones
- Disk Usage from transient filesystem: Unbounded
- CPU/RAM usage: Unbounded
- Propolis (Note that the Propolis zone must consume resources in addition to the bounds on the underlying instance)
- Other Datasets (See: DatasetKind)
- Debug Dataset
- Disk Usage: It's bounded to a maximum of 100 GiB, but has no reservation. Should it have a reservation?
- Update
- Disk Usage: Unbounded? But I cannot tell where this is being allocated. It may be possible to delete this.
- Debug Dataset
Issues
(Note that the "consumer" here does not necessarily imply blame - if a neighboring service has consumed excessive resources, a consumer may fail prematurely)
- Resource: Disk, Consumer = Crucible (via Nexus): Nexus overprovisions disks with control plane zones on them #7225
- Resource: Disk: want better runtime isolation of sled components from ENOSPC #7227
- Resource: Disk, Consumer = All Zones: "datasets ensure" API in sled agent should remove unmatched datasets (eventually) #6177
- Resource: Disk, Consumer = Crucible: crucible zone running out of space earlier than the control plane "thinks" #4234
- Resource: Disk, Consumer = CockroachDB:
saga
+saga_node_event
tables are never garbage collected, but they should be #6635 - Resource: Disk, Consumer = Crucible, via Propolis: 500 error during instance creation due to space exhaustion #7294
- Resource: CPU & RAM, Consumer = "everything except propolis" Track provisioned CPU and RAM usage for services #2806
- Resource: CPU & RAM Tracking Issue for Managing CPU & RAM Capacity #2648 (this is an older issue pre-dating blueprints, probably needs updating)
- Resource: Disk Tracking Issue for Managing Low Disk Space Conditions #1700 (this is an older issue pre-dating blueprints, probably needs updating)
- Visibility: Visibility into Capacity, utilization, and resource limits tracking issue #4751
- Visibility: Add a better error for when a user tries to provision past their quota #4680
- Resource: Disk, Operator View: Current Capacity of Storage #2036