reconnect when agent is offline #1142

car-roll · 2025-09-05T17:28:16Z

When an EC2 agent loses connection, the pipeline will detect this and terminate the job after 5 minutes. However, there is no retry logic. So in the case of an unstable connection, one lost connection means the pipeline will fail. Previously, the internal check of EC2RetentionStrategy only checks if a node goes idle. It never actually checked if the node was offline first.

Testing done

Manual testing:
Created a pipeline that called a long sleep on an ec2 agent.
Once EC2 agent was spun up, went to script console and terminated the sshd connection
Waited for the pipeline to reconnect to the agent.
Cancelled multiple times, reconnection always happened

Alsom, killed sshd connection just as job is about to complete. Agent will reconnect and mark job as completed

Submitter checklist

Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
Ensure that the pull request title represents the desired changelog entry
Please describe what you did
Link to relevant issues in GitHub or Jira
Link to relevant pull requests, esp. upstream and downstream changes
Ensure you have provided tests that demonstrate the feature works or the issue is fixed

jglick · 2025-09-05T17:49:28Z

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java

-                        + ", will retry next check. Exception: " + e);
-                return CHECK_INTERVAL_MINUTES;
-            }
+        final long uptime;


(ignore WS)

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java

car-roll · 2025-09-08T23:36:16Z

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java

                long currentTime = this.clock.millis();

                if (currentTime > nextCheckAfter) {
+                    attemptReconnectIfOffline(c);


So after rebasing this PR to master, I discovered that the offline check does not happen if the EC2 agent is set to never be terminated, i.e. idleTerminationMinutes=0.
I moved it here to guarantee the offline check

after rebasing this PR to master

Please just use git pull inside a PR branch and avoid force-pushing, which breaks incremental review in most cases. (Unlikely to matter for a PR with such a short diff, but a good habit.)

bad choice of words. I meant after i clicked the update PR button. But yes, noted

jglick · 2025-09-09T17:51:42Z

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java

                long currentTime = this.clock.millis();

                if (currentTime > nextCheckAfter) {
+                    attemptReconnectIfOffline(c);


Do we want to do this even if DISABLED, in which case AFAICT the other behaviors of the strategy are a no-op?

I mean I can see it both ways. My opinion is that this is more along the lines of handling network instability versus what to do with a node if it is idle.

jglick

OK AFAICT (untested)

mikecirioli

LGTM

mikecirioli

I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).

mikecirioli

I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).

jglick · 2025-09-16T17:14:52Z

I guess up to @res0nance @fcojfernandez et al. to decide whether/when to merge? bug label recommended.

Shohou · 2025-09-26T13:03:47Z

I believe this broke the flow when instance is kept in Stopped state after job is finished and idle timeout has passed, which is controlled by "Stop/Disconnect on Idle Timeout" checkbox setting. Created an issue - https://issues.jenkins.io/browse/JENKINS-76151

reconnect when agent is offline

e669f21

jglick reviewed Sep 5, 2025

View reviewed changes

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java Outdated Show resolved Hide resolved

src/main/java/hudson/plugins/ec2/EC2RetentionStrategy.java Outdated Show resolved Hide resolved

car-roll added 2 commits September 5, 2025 12:47

address comments

ea20052

Merge branch 'master' into reconnect

d6a0e5b

car-roll marked this pull request as ready for review September 8, 2025 19:11

jglick approved these changes Sep 8, 2025

View reviewed changes

move offline check

4dc8ed2

car-roll force-pushed the reconnect branch from 707d4cb to 4dc8ed2 Compare September 8, 2025 23:34

car-roll commented Sep 8, 2025

View reviewed changes

jglick reviewed Sep 9, 2025

View reviewed changes

jglick approved these changes Sep 9, 2025

View reviewed changes

mikecirioli approved these changes Sep 10, 2025

View reviewed changes

amuniz approved these changes Sep 11, 2025

View reviewed changes

mikecirioli approved these changes Sep 15, 2025

View reviewed changes

res0nance added the bugfix label Sep 17, 2025

Merge branch 'master' into reconnect

de5c10e

res0nance approved these changes Sep 17, 2025

View reviewed changes

res0nance merged commit 0a11fb7 into jenkinsci:master Sep 17, 2025
17 checks passed

car-roll deleted the reconnect branch September 23, 2025 18:05

This was referenced Oct 10, 2025

[JENKINS-76151] - Fix EC2 Continuous attempts to connect to stopped instances #1151

Merged

[JENKINS-76200] - Fix Stopped agents not stating on v2039 #1155

Merged

jenkins-infra-bot mentioned this pull request Oct 14, 2025

[JENKINS-76151] Continuous attempts to connect to the stopped instance #1979

Open

reconnect when agent is offline #1142

reconnect when agent is offline #1142

Uh oh!

Conversation

car-roll commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing done

Submitter checklist

Uh oh!

jglick Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

car-roll Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

jglick Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

car-roll Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jglick Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

car-roll Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

mikecirioli left a comment

Choose a reason for hiding this comment

Uh oh!

mikecirioli left a comment

Choose a reason for hiding this comment

Uh oh!

mikecirioli left a comment

Choose a reason for hiding this comment

Uh oh!

jglick commented Sep 16, 2025

Uh oh!

Uh oh!

Shohou commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

car-roll commented Sep 5, 2025 •

edited

Loading