-
Notifications
You must be signed in to change notification settings - Fork 707
[JENKINS-76171] fix overprovisioning of ec2 agents when rapidly scheduling nodes #1149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java
Outdated
Show resolved
Hide resolved
| offlineNodes++; | ||
| } | ||
|
|
||
| // Only count nodes that are OFFLINE (not connecting, not online) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about spare instances? Are these going to be counted? Do we want them to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, the current implementation counts all offline EC2 nodes matching the label, regardless of why they were provisioned. I think that if you have spare capacity configured you would eventually reach a steady state (if the builds take long enough) that you would see your "spare" capacity re-appear again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the conclusion was here. Was this scenario tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this scenario, the "spare" instances are used, and the total provisioning ends up being # of jobs + spare capacity. AFAIU, this matches the original expected behavior?
src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java
Outdated
Show resolved
Hide resolved
src/main/java/hudson/plugins/ec2/util/MinimumInstanceChecker.java
Outdated
Show resolved
Hide resolved
src/main/java/hudson/plugins/ec2/util/MinimumInstanceChecker.java
Outdated
Show resolved
Hide resolved
jglick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not see anything wrong in main source changes. Did not really look at test.
I am working on some additional tests and have a few manual scenarios i plan to run on monday. I'll add the results of the manual testing to the pr desc. |
apuig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed change makes sense: handling in-flight instance deployments is now addressed. Acknowledge the need for synchronization.
The tests are appropriate and complete.
I manually verified that countProvisionedButNotExecutingNodes includes instances in the provisioning phase by deploying an instance and observing the method's output over time.
Great job !!! 🚀
src/test/java/hudson/plugins/ec2/NoDelayProvisionerStrategyTest.java
Outdated
Show resolved
Hide resolved
src/main/java/hudson/plugins/ec2/NoDelayProvisionerStrategy.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Albert Puig <[email protected]>
|
@res0nance is anything else needed before this can be merged? |
Fix Severe Over-Provisioning in NoDelayProvisionerStrategy
Problem
The EC2 plugin was massively over-provisioning nodes when using NoDelayProvisionerStrategy, particularly with maxTotalUses=1. Testing showed 100 builds resulted in 500-600 nodes being
provisioned instead of the expected 100 (5-6x over-provisioning).
Root Causes
- Offline nodes (gap between PlannedNode completion and connection start)
- Busy executors (online nodes executing jobs but not idle)
Solution
Testing done
I've tested this (pre and post fix) extensively using an actual
ec2 cloudas described in the linked jira ticket. Using the fix, i am able to consistently get a 1:1 build:node ratio.Submitter checklist