possible fix for JENKINS-76200 #1152

mikecirioli · 2025-10-14T15:58:50Z

[JENKINS-76200] Start stopped instances in attemptReconnectIfOffline + extensive debug logging

This commit adds both a fix and extensive debug logging to diagnose and resolve
JENKINS-76200 where existing offline Jenkins nodes with stopped EC2 instances
are never restarted, causing jobs to wait indefinitely.

Root Cause Analysis:
When an existing Jenkins node is offline with a stopped EC2 instance, Jenkins
does NOT call start() on the retention strategy. Instead, it periodically calls
check() which calls attemptReconnectIfOffline(). The original implementation
only attempted reconnection for RUNNING instances, leaving stopped instances
in limbo forever.

Changes Made:

Added extensive debug logging throughout EC2RetentionStrategy:
- start(): Logs when called, instance state during startup
- check(): Logs when executing and when calling attemptReconnectIfOffline()
- attemptReconnectIfOffline(): Logs state, offline status, connecting status,
  all decision branches, AWS API calls, and any errors
All debug logs are tagged with [JENKINS-76200] for easy filtering.
Modified attemptReconnectIfOffline() to start stopped instances:
- Check if instance is STOPPED or STOPPING
- If offline, call AWS startInstances() API to start the instance
- Log the start operation and result
- Return immediately after starting (don't attempt connection yet)
- On next check() cycle, instance should be PENDING or RUNNING
- Existing reconnection logic handles RUNNING instances
Added necessary imports:
- software.amazon.awssdk.services.ec2.Ec2Client
- software.amazon.awssdk.services.ec2.model.StartInstancesRequest

Testing Instructions:

Deploy this build to Jenkins with an offline node that has a stopped instance
Trigger a job that requires the offline node
Check Jenkins logs for [JENKINS-76200] messages showing:
- check() being called
- attemptReconnectIfOffline() detecting STOPPED state
- AWS startInstances() being called
- Instance transitioning to PENDING/RUNNING
- Successful connection
Verify in AWS console that the instance actually starts

Expected Behavior:

Logs should clearly show the flow from check() -> attemptReconnectIfOffline()
Instance should start in AWS when detected as STOPPED
Node should eventually come online and job should execute

Previous commit started stopped instances whenever check() ran, even when no jobs were waiting. This caused unnecessary instance starts. Added check to only start stopped instance if itemsInQueueForThisSlave() returns true, meaning there are actually jobs waiting for this specific node. Changes: - Check for queued jobs before calling startInstances() - Log whether jobs are queued: "jobs in queue: true/false" - Skip starting if no jobs: "No jobs waiting - leaving it stopped" - Only start if jobs waiting: "Jobs are waiting - attempting to start" This ensures stopped instances remain stopped until actually needed for work.

Changed from checking only explicit node assignment (selfLabel) to using Label.contains() which properly checks if a node can execute jobs based on label matching. This fixes the issue where stopped instances would only start for jobs explicitly tied to the node name, not for jobs that match the node's labels. Changes: - Use assignedLabel.contains(selfNode) instead of assignedLabel == selfLabel - Handle null assignedLabel (jobs that can run on any node) - Added comment explaining the label matching logic Now stopped instances will start for: - Jobs with no label requirement (assignedLabel == null) - Jobs whose labels match this node's capabilities (assignedLabel.contains(selfNode)) Before this fix, stopped instances only started for jobs explicitly tied to the specific node name.

The NoDelayProvisionerStrategy was counting offline STOPPED EC2 instances as "available capacity", preventing provisioning from being triggered when jobs were queued. This caused STOPPED instances to remain stopped forever, with jobs waiting indefinitely. Root cause: - countProvisionedButNotExecutingNodes() counted ALL offline nodes - STOPPED instances were included in available capacity - When capacity >= demand, provisioning was skipped - provisionOndemand() was never called to start the stopped instances Fix: - Check AWS instance state for offline nodes - Exclude STOPPED/STOPPING instances from capacity count - Only count instances that will come online (PENDING/RUNNING) - Fail-safe: if state check fails, count the instance to avoid over-provisioning This preserves the fixes from: - JENKINS-76151: EC2RetentionStrategy still only reconnects RUNNING instances - JENKINS-76171: Offline PENDING/RUNNING instances still counted to prevent over-provisioning Testing: 1. Stop an EC2 instance (via AWS or Jenkins stopOnTerminate) 2. Queue a job requiring that label 3. Verify provisioning is triggered and instance starts in AWS 4. Check logs for "Excluding STOPPED instance {id} from available capacity"

mikecirioli added 4 commits October 14, 2025 11:58

possible fix

ad5e732

debug logging

aeb4953

mikecirioli changed the title ~~possible fix~~ possible fix for JENKINS-76200 Oct 14, 2025

mikecirioli force-pushed the fix/JENKINS-76200-start-in-reconnect-with-logging branch from 3350f5a to f810159 Compare October 15, 2025 12:03

mikecirioli closed this Oct 15, 2025

mikecirioli deleted the fix/JENKINS-76200-start-in-reconnect-with-logging branch October 15, 2025 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

possible fix for JENKINS-76200 #1152

possible fix for JENKINS-76200 #1152

Uh oh!

mikecirioli commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

possible fix for JENKINS-76200 #1152

possible fix for JENKINS-76200 #1152

Uh oh!

Conversation

mikecirioli commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant