Skip to content

Conversation

@car-roll
Copy link
Contributor

@car-roll car-roll commented Sep 5, 2025

When an EC2 agent loses connection, the pipeline will detect this and terminate the job after 5 minutes. However, there is no retry logic. So in the case of an unstable connection, one lost connection means the pipeline will fail. Previously, the internal check of EC2RetentionStrategy only checks if a node goes idle. It never actually checked if the node was offline first.

Testing done

Manual testing:
Created a pipeline that called a long sleep on an ec2 agent.
Once EC2 agent was spun up, went to script console and terminated the sshd connection
Waited for the pipeline to reconnect to the agent.
Cancelled multiple times, reconnection always happened

Alsom, killed sshd connection just as job is about to complete. Agent will reconnect and mark job as completed

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

+ ", will retry next check. Exception: " + e);
return CHECK_INTERVAL_MINUTES;
}
final long uptime;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ignore WS)

@car-roll car-roll marked this pull request as ready for review September 8, 2025 19:11
long currentTime = this.clock.millis();

if (currentTime > nextCheckAfter) {
attemptReconnectIfOffline(c);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So after rebasing this PR to master, I discovered that the offline check does not happen if the EC2 agent is set to never be terminated, i.e. idleTerminationMinutes=0.
I moved it here to guarantee the offline check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after rebasing this PR to master

Please just use git pull inside a PR branch and avoid force-pushing, which breaks incremental review in most cases. (Unlikely to matter for a PR with such a short diff, but a good habit.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad choice of words. I meant after i clicked the update PR button. But yes, noted

long currentTime = this.clock.millis();

if (currentTime > nextCheckAfter) {
attemptReconnectIfOffline(c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to do this even if DISABLED, in which case AFAICT the other behaviors of the strategy are a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I can see it both ways. My opinion is that this is more along the lines of handling network instability versus what to do with a node if it is idle.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK AFAICT (untested)

Copy link
Contributor

@mikecirioli mikecirioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mikecirioli mikecirioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).

Copy link
Contributor

@mikecirioli mikecirioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have validated this fix as working by reproducing the original issue, updating the plugin, and testing again. After the update the agent is successfully reconnected and the build finishes correctly (from the perspective of the controller).

@jglick
Copy link
Member

jglick commented Sep 16, 2025

I guess up to @res0nance @fcojfernandez et al. to decide whether/when to merge? bug label recommended.

@res0nance res0nance merged commit 0a11fb7 into jenkinsci:master Sep 17, 2025
17 checks passed
@car-roll car-roll deleted the reconnect branch September 23, 2025 18:05
@Shohou
Copy link

Shohou commented Sep 26, 2025

I believe this broke the flow when instance is kept in Stopped state after job is finished and idle timeout has passed, which is controlled by "Stop/Disconnect on Idle Timeout" checkbox setting. Created an issue - https://issues.jenkins.io/browse/JENKINS-76151

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants