[JENKINS-76171] fix overprovisioning of ec2 agents when rapidly scheduling nodes #1149

mikecirioli · 2025-10-03T15:07:25Z

Fix Severe Over-Provisioning in NoDelayProvisionerStrategy

Problem

The EC2 plugin was massively over-provisioning nodes when using NoDelayProvisionerStrategy, particularly with maxTotalUses=1. Testing showed 100 builds resulted in 500-600 nodes being
provisioned instead of the expected 100 (5-6x over-provisioning).

Root Causes

Missing capacity accounting - Two critical node states were not counted:
- Offline nodes (gap between PlannedNode completion and connection start)
- Busy executors (online nodes executing jobs but not idle)
Race condition - Multiple threads calling MinimumInstanceChecker.checkForMinimumInstances() simultaneously without synchronization
Logic bug - MinimumInstanceChecker provisioned for queued builds even when minimumNumberOfSpareInstances=0
Connection timing - Only stopOnTerminate instances initiated connection when reaching RUNNING state

Solution

Enhanced Capacity Accounting (NoDelayProvisionerStrategy)

Added countProvisionedButNotExecutingNodes() to count offline EC2 nodes
Added countBusyExecutors() to count online busy executors
Both counts now included in availableCapacity calculation

Race Condition Prevention (MinimumInstanceChecker)

Added synchronized keyword to checkForMinimumInstances()
Added early-exit optimization for minimumNumberOfInstances=0 configurations
Prevents concurrent threads from duplicating provisioning decisions

Logic Fix (MinimumInstanceChecker)

Only provision spare capacity when minimumNumberOfSpareInstances > 0
Respects user configuration to disable spare capacity provisioning

Connection Timing (EC2Cloud)

Always call connect() when instance reaches RUNNING state
Reduces offline gap for all instances, not just stopOnTerminat

Testing done

I've tested this (pre and post fix) extensively using an actual ec2 cloud as described in the linked jira ticket. Using the fix, i am able to consistently get a 1:1 build:node ratio.

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests that demonstrate the feature works or the issue is fixed

src/main/java/hudson/plugins/ec2/EC2Cloud.java

src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java

jglick · 2025-10-03T15:20:08Z

src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java

+                offlineNodes++;
+            }
+
+            // Only count nodes that are OFFLINE (not connecting, not online)


What about spare instances? Are these going to be counted? Do we want them to be?

to be tested

good question, the current implementation counts all offline EC2 nodes matching the label, regardless of why they were provisioned. I think that if you have spare capacity configured you would eventually reach a steady state (if the builds take long enough) that you would see your "spare" capacity re-appear again.

I am not sure what the conclusion was here. Was this scenario tested?

I tested this scenario, the "spare" instances are used, and the total provisioning ends up being # of jobs + spare capacity. AFAIU, this matches the original expected behavior?

src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java

src/main/java/hudson/plugins/ec2/util/MinimumInstanceChecker.java

jglick

Do not see anything wrong in main source changes. Did not really look at test.

mikecirioli · 2025-10-05T19:05:37Z

Do not see anything wrong in main source changes. Did not really look at test.

I am working on some additional tests and have a few manual scenarios i plan to run on monday. I'll add the results of the manual testing to the pr desc.

apuig

The proposed change makes sense: handling in-flight instance deployments is now addressed. Acknowledge the need for synchronization.

The tests are appropriate and complete.
I manually verified that countProvisionedButNotExecutingNodes includes instances in the provisioning phase by deploying an instance and observing the method's output over time.

Great job !!! 🚀

src/test/java/hudson/plugins/ec2/NoDelayProvisionerStrategyTest.java

src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java

Co-authored-by: Albert Puig <[email protected]>

mikecirioli · 2025-10-10T11:39:23Z

@res0nance is anything else needed before this can be merged?

initial impl fixing overprovisioning

c457ff9