Description
Proposal
Add a configuration option to limit the number of allocations that can be started concurrently on a Nomad client. "Starting" in this context includes pulling container images and launching the container, but not waiting for health checks to pass. Once a container is running (even if health checks are still pending), the next allocation should be allowed to start.
Use-cases
When a node is undrained or restarted, Nomad may attempt to start a large number of allocations at once. This can overwhelm the node's disk and network subsystems, especially if many container images need to be pulled simultaneously. Limiting the number of concurrently starting allocations would help to:
- Prevent resource exhaustion (disk, network, CPU) during mass allocation startups.
- Smooth out the load on the node and the container registry.
- Reduce the risk of failed allocations due to resource contention.
- Provide more predictable and stable node recovery after undrain or restart events.
Attempted Solutions
- Resource limits (CPU, memory, disk) per allocation do not prevent Nomad from starting many allocations at once, as long as the total requested resources fit within the node's capacity.
- There is no existing configuration in Nomad or the Docker driver to limit the number of allocations being started concurrently.
- Docker itself has a limit on concurrent layer downloads per image, but not on the number of images being pulled at once.
- Workarounds such as scripting undrain events or using external controllers are fragile and not integrated with Nomad's scheduling logic.
A built-in, configurable limit would provide a robust and user-friendly solution to this problem.
Metadata
Metadata
Assignees
Type
Projects
Status