Skip to content

Conversation

@jgarciacloudbees
Copy link
Contributor

@jgarciacloudbees jgarciacloudbees commented Oct 10, 2025

JENKINS-76151 - Continuous attempts to connect to the stopped instance

After implementing #1142 in Amazon EC2 Plugin 2034.v0a_11fb_792b_ee, there are some issues with instances stopped (not terminated).

With #1142, the system is continously trying to connect to disconnected nodes, independently of its real status.

In case we have configured the EC2 Cloud in pool mode (nodes are not deleted after using, but just stopped after a idle timeout. When a new job is created, these nodes are started again, instead of provisioning new ones), the previous code is continously trying to reconnect with these stopped instances, leading to the node not able to restart, and continous messages in the logs similar to:

Sep 25, 2025 12:22:25 PM INFO hudson.plugins.ec2.EC2Cloud log
Failed to connect via ssh: DefaultConnectFuture[ubuntu@/10.131.100.147:22]: Failed (NoRouteToHostException) to execute: No route to host
Sep 25, 2025 12:22:25 PM INFO hudson.plugins.ec2.EC2Cloud log
Waiting for SSH to come up. Sleeping 5.
Sep 25, 2025 12:22:30 PM INFO hudson.plugins.ec2.EC2Cloud log
Connecting to 10.131.100.147 on port 22, with timeout 10000.
Sep 25, 2025 12:22:33 PM INFO hudson.plugins.ec2.EC2Cloud log
Failed to connect via ssh: DefaultConnectFuture[ubuntu@/10.131.100.147:22]: Failed (NoRouteToHostException) to execute: No route to host
Sep 25, 2025 12:22:33 PM INFO hudson.plugins.ec2.EC2Cloud log
Waiting for SSH to come up. Sleeping 5.

Continues like this forever...

Included a new condition in the hudson.plugins.ec2.EC2RetentionStrategy#attemptReconnectIfOffline method to only try to reconnect in case the instance is already started.

With this, the issue we fixed with #1142 is working properly, and it does not crash with stopped instances:

+ sleep 150
>>>>>>>>>>>>>> SSH connection break with kill -9 in the proper EC2 instance<<<<<<<<<<<<<<<<<<<<<<<<<
EC2 (ec2) - ec2 (i-0989fd066ced0586f) seems to be removed or offline (hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@6c0784de:EC2 (ec2) - ec2 (i-0989fd066ced0586f)": Remote call on EC2 (ec2) - ec2 (i-0989fd066ced0586f) failed. The channel is closing down or has closed down); will wait for 5 min 0 sec for it to come back online
>>>>>>>>>>>>>> The system is able to reconnect, and the job ends with success <<<<<<<<<<<<<<<<<<<<<<<<<
EC2 (ec2) - ec2 (i-0989fd066ced0586f) is back online
+ date
Fri Oct 10 15:22:23 UTC 2025
[Pipeline] echo
Ending
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

And Instance pool behavior works as previously:

>>>>>>>>>>>>>>>>>>>>Instance is Stopped
2025-10-10 15:17:45.072+0000 [id=132]   INFO    h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (ec2) - ec2 (i-0989fd066ced0586f) after 1 idle minutes, instance statusRUNNING
2025-10-10 15:17:45.072+0000 [id=132]   INFO    h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: i-0989fd066ced0586f
2025-10-10 15:17:45.650+0000 [id=132]   INFO    h.plugins.ec2.EC2AbstractSlave#stop: EC2 instance stop request sent for i-0989fd066ced0586f

>>>>>>>>>>>>>>>>>>>>Instance is Started again to process new jobs
2025-10-10 15:19:01.110+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#provision: SlaveTemplate{description='ec2', labels='ec2'}. Attempting to provision agent needed by excess workload of 1 units
...
2025-10-10 15:19:03.860+0000 [id=1121]  INFO    hudson.plugins.ec2.EC2Cloud$2#call: Attempt 0: SlaveTemplate{description='ec2', labels='ec2'}. Node EC2 (ec2) - ec2 (i-0989fd066ced0586f) is neither pending, neither running, it's stopped. Will try again after 5s
2025-10-10 15:19:09.218+0000 [id=1121]  INFO    hudson.plugins.ec2.EC2Cloud$2#call: SlaveTemplate{description='ec2', labels='ec2'} Node EC2 (ec2) - ec2 (i-0989fd066ced0586f) moved to RUNNING state in -6 seconds and is ready to be connected by Jenkins
2025-10-10 15:19:09.219+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: Launching instance: i-0989fd066ced0586f
2025-10-10 15:19:09.219+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: bootstrap()
...
2025-10-10 15:19:31.442+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: Connected via SSH.
2025-10-10 15:19:31.821+0000 [id=3938]  INFO    hudson.plugins.ec2.EC2Cloud#log: Connection allowed after the host key has been verified
2025-10-10 15:19:31.821+0000 [id=3938]  INFO    o.a.s.c.k.e.p.HostBoundPubkeyAuthentication#parseExtension: Server announced support for [email protected] version 0

>>>>>>>>>>>>>>>>>>>>>>Test case to verify the previous issue is addressed, SSH break connections is restored.
2025-10-10 15:20:31.749+0000 [id=3940]  INFO    h.r.SynchronousCommandTransport$ReaderThread#run: I/O error in channel EC2 (ec2) - ec2 (i-0989fd066ced0586f)
java.io.EOFException
        at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2933)
        at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3428)
        at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:985)
        at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:416)
        at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:50)
        at hudson.remoting.Command.readFrom(Command.java:141)
        at hudson.remoting.Command.readFrom(Command.java:127)
        at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:62)
Caused: java.io.IOException: Unexpected termination of the channel
        at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:80)
2025-10-10 15:20:46.868+0000 [id=142]   WARNING h.p.ec2.EC2RetentionStrategy#attemptReconnectIfOffline: EC2Computer EC2 (ec2) - ec2 (i-0989fd066ced0586f) is offline
2025-10-10 15:20:46.868+0000 [id=142]   WARNING h.p.ec2.EC2RetentionStrategy#attemptReconnectIfOffline: Attempting to reconnect EC2Computer EC2 (ec2) - ec2 (i-0989fd066ced0586f)
2025-10-10 15:20:46.869+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: Launching instance: i-0989fd066ced0586f
2025-10-10 15:20:46.869+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: bootstrap()
...

>>>>>>>>>>>>>>>>>>>>>>The agent is able to reconnect after the SSH disconnection
2025-10-10 15:20:52.623+0000 [id=3200]  INFO    hudson.plugins.ec2.EC2Cloud#log: Connected via SSH.
2025-10-10 15:20:53.064+0000 [id=4064]  INFO    hudson.plugins.ec2.EC2Cloud#log: Connection allowed after the host key has been verified
2025-10-10 15:20:53.065+0000 [id=4064]  INFO    o.a.s.c.k.e.p.HostBoundPubkeyAuthentication#parseExtension: Server announced support for [email protected] version 0
2025-10-10 15:23:48.576+0000 [id=132]   INFO    h.p.ec2.EC2RetentionStrategy#internalCheck: Idle timeout of EC2 (ec2) - ec2 (i-0989fd066ced0586f) after 1 idle minutes, instance statusRUNNING
2025-10-10 15:23:48.576+0000 [id=132]   INFO    h.plugins.ec2.EC2AbstractSlave#idleTimeout: EC2 instance idle time expired: i-0989fd066ced0586f
>>>>>>>>>>>>>>>>>>>>>>After the idle timeout, the instance is stopped again, and ready to process new jobs.

Testing done

Manual tests based on the original issue in #1142 and the current fixing for EC2 stopped agents.

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

Copy link
Contributor

@mikecirioli mikecirioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@res0nance res0nance merged commit 829e627 into jenkinsci:master Oct 11, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants