Skip to content

Conversation

@mikecirioli
Copy link
Contributor

Fix Severe Over-Provisioning in NoDelayProvisionerStrategy

Problem

The EC2 plugin was massively over-provisioning nodes when using NoDelayProvisionerStrategy, particularly with maxTotalUses=1. Testing showed 100 builds resulted in 500-600 nodes being
provisioned instead of the expected 100 (5-6x over-provisioning).

Root Causes

  1. Missing capacity accounting - Two critical node states were not counted:
    - Offline nodes (gap between PlannedNode completion and connection start)
    - Busy executors (online nodes executing jobs but not idle)
  2. Race condition - Multiple threads calling MinimumInstanceChecker.checkForMinimumInstances() simultaneously without synchronization
  3. Logic bug - MinimumInstanceChecker provisioned for queued builds even when minimumNumberOfSpareInstances=0
  4. Connection timing - Only stopOnTerminate instances initiated connection when reaching RUNNING state

Solution

  1. Enhanced Capacity Accounting (NoDelayProvisionerStrategy)
  • Added countProvisionedButNotExecutingNodes() to count offline EC2 nodes
  • Added countBusyExecutors() to count online busy executors
  • Both counts now included in availableCapacity calculation
  1. Race Condition Prevention (MinimumInstanceChecker)
  • Added synchronized keyword to checkForMinimumInstances()
  • Added early-exit optimization for minimumNumberOfInstances=0 configurations
  • Prevents concurrent threads from duplicating provisioning decisions
  1. Logic Fix (MinimumInstanceChecker)
  • Only provision spare capacity when minimumNumberOfSpareInstances > 0
  • Respects user configuration to disable spare capacity provisioning
  1. Connection Timing (EC2Cloud)
  • Always call connect() when instance reaches RUNNING state
  • Reduces offline gap for all instances, not just stopOnTerminat

Testing done

I've tested this (pre and post fix) extensively using an actual ec2 cloud as described in the linked jira ticket. Using the fix, i am able to consistently get a 1:1 build:node ratio.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

offlineNodes++;
}

// Only count nodes that are OFFLINE (not connecting, not online)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about spare instances? Are these going to be counted? Do we want them to be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be tested

Copy link
Contributor Author

@mikecirioli mikecirioli Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, the current implementation counts all offline EC2 nodes matching the label, regardless of why they were provisioned. I think that if you have spare capacity configured you would eventually reach a steady state (if the builds take long enough) that you would see your "spare" capacity re-appear again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the conclusion was here. Was this scenario tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this scenario, the "spare" instances are used, and the total provisioning ends up being # of jobs + spare capacity. AFAIU, this matches the original expected behavior?

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not see anything wrong in main source changes. Did not really look at test.

@mikecirioli
Copy link
Contributor Author

Do not see anything wrong in main source changes. Did not really look at test.

I am working on some additional tests and have a few manual scenarios i plan to run on monday. I'll add the results of the manual testing to the pr desc.

Copy link
Contributor

@apuig apuig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed change makes sense: handling in-flight instance deployments is now addressed. Acknowledge the need for synchronization.

The tests are appropriate and complete.
I manually verified that countProvisionedButNotExecutingNodes includes instances in the provisioning phase by deploying an instance and observing the method's output over time.

Great job !!! 🚀

@mikecirioli
Copy link
Contributor Author

@res0nance is anything else needed before this can be merged?

@res0nance res0nance merged commit 95ca008 into jenkinsci:master Oct 10, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants