-
Notifications
You must be signed in to change notification settings - Fork 707
reconnect when agent is offline #1142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| + ", will retry next check. Exception: " + e); | ||
| return CHECK_INTERVAL_MINUTES; | ||
| } | ||
| final long uptime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ignore WS)
| long currentTime = this.clock.millis(); | ||
|
|
||
| if (currentTime > nextCheckAfter) { | ||
| attemptReconnectIfOffline(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after rebasing this PR to master, I discovered that the offline check does not happen if the EC2 agent is set to never be terminated, i.e. idleTerminationMinutes=0.
I moved it here to guarantee the offline check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after rebasing this PR to master
Please just use git pull inside a PR branch and avoid force-pushing, which breaks incremental review in most cases. (Unlikely to matter for a PR with such a short diff, but a good habit.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad choice of words. I meant after i clicked the update PR button. But yes, noted
| long currentTime = this.clock.millis(); | ||
|
|
||
| if (currentTime > nextCheckAfter) { | ||
| attemptReconnectIfOffline(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to do this even if DISABLED, in which case AFAICT the other behaviors of the strategy are a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean I can see it both ways. My opinion is that this is more along the lines of handling network instability versus what to do with a node if it is idle.
jglick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK AFAICT (untested)
mikecirioli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mikecirioli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).
mikecirioli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).
|
I guess up to @res0nance @fcojfernandez et al. to decide whether/when to merge? |
|
I believe this broke the flow when instance is kept in Stopped state after job is finished and idle timeout has passed, which is controlled by "Stop/Disconnect on Idle Timeout" checkbox setting. Created an issue - https://issues.jenkins.io/browse/JENKINS-76151 |
When an EC2 agent loses connection, the pipeline will detect this and terminate the job after 5 minutes. However, there is no retry logic. So in the case of an unstable connection, one lost connection means the pipeline will fail. Previously, the internal check of
EC2RetentionStrategyonly checks if a node goes idle. It never actually checked if the node was offline first.Testing done
Manual testing:
Created a pipeline that called a long
sleepon an ec2 agent.Once EC2 agent was spun up, went to script console and terminated the
sshdconnectionWaited for the pipeline to reconnect to the agent.
Cancelled multiple times, reconnection always happened
Alsom, killed sshd connection just as job is about to complete. Agent will reconnect and mark job as completed
Submitter checklist