Skip to content

Conversation

@mikecirioli
Copy link
Contributor

[JENKINS-76200] Start stopped instances in attemptReconnectIfOffline + extensive debug logging

This commit adds both a fix and extensive debug logging to diagnose and resolve
JENKINS-76200 where existing offline Jenkins nodes with stopped EC2 instances
are never restarted, causing jobs to wait indefinitely.

Root Cause Analysis:
When an existing Jenkins node is offline with a stopped EC2 instance, Jenkins
does NOT call start() on the retention strategy. Instead, it periodically calls
check() which calls attemptReconnectIfOffline(). The original implementation
only attempted reconnection for RUNNING instances, leaving stopped instances
in limbo forever.

Changes Made:

  1. Added extensive debug logging throughout EC2RetentionStrategy:

    • start(): Logs when called, instance state during startup
    • check(): Logs when executing and when calling attemptReconnectIfOffline()
    • attemptReconnectIfOffline(): Logs state, offline status, connecting status,
      all decision branches, AWS API calls, and any errors

    All debug logs are tagged with [JENKINS-76200] for easy filtering.

  2. Modified attemptReconnectIfOffline() to start stopped instances:

    • Check if instance is STOPPED or STOPPING
    • If offline, call AWS startInstances() API to start the instance
    • Log the start operation and result
    • Return immediately after starting (don't attempt connection yet)
    • On next check() cycle, instance should be PENDING or RUNNING
    • Existing reconnection logic handles RUNNING instances
  3. Added necessary imports:

    • software.amazon.awssdk.services.ec2.Ec2Client
    • software.amazon.awssdk.services.ec2.model.StartInstancesRequest

Testing Instructions:

  1. Deploy this build to Jenkins with an offline node that has a stopped instance

  2. Trigger a job that requires the offline node

  3. Check Jenkins logs for [JENKINS-76200] messages showing:

    • check() being called
    • attemptReconnectIfOffline() detecting STOPPED state
    • AWS startInstances() being called
    • Instance transitioning to PENDING/RUNNING
    • Successful connection
  4. Verify in AWS console that the instance actually starts

Expected Behavior:

  • Logs should clearly show the flow from check() -> attemptReconnectIfOffline()
  • Instance should start in AWS when detected as STOPPED
  • Node should eventually come online and job should execute

Previous commit started stopped instances whenever check() ran, even when
no jobs were waiting. This caused unnecessary instance starts.

Added check to only start stopped instance if itemsInQueueForThisSlave()
returns true, meaning there are actually jobs waiting for this specific node.

Changes:
- Check for queued jobs before calling startInstances()
- Log whether jobs are queued: "jobs in queue: true/false"
- Skip starting if no jobs: "No jobs waiting - leaving it stopped"
- Only start if jobs waiting: "Jobs are waiting - attempting to start"

This ensures stopped instances remain stopped until actually needed for work.
Changed from checking only explicit node assignment (selfLabel) to using
Label.contains() which properly checks if a node can execute jobs based
on label matching. This fixes the issue where stopped instances would
only start for jobs explicitly tied to the node name, not for jobs that
match the node's labels.

Changes:
- Use assignedLabel.contains(selfNode) instead of assignedLabel == selfLabel
- Handle null assignedLabel (jobs that can run on any node)
- Added comment explaining the label matching logic

Now stopped instances will start for:
- Jobs with no label requirement (assignedLabel == null)
- Jobs whose labels match this node's capabilities (assignedLabel.contains(selfNode))

Before this fix, stopped instances only started for jobs explicitly tied to
the specific node name.
@mikecirioli mikecirioli changed the title possible fix possible fix for JENKINS-76200 Oct 14, 2025
The NoDelayProvisionerStrategy was counting offline STOPPED EC2 instances
as "available capacity", preventing provisioning from being triggered when
jobs were queued. This caused STOPPED instances to remain stopped forever,
with jobs waiting indefinitely.

Root cause:
- countProvisionedButNotExecutingNodes() counted ALL offline nodes
- STOPPED instances were included in available capacity
- When capacity >= demand, provisioning was skipped
- provisionOndemand() was never called to start the stopped instances

Fix:
- Check AWS instance state for offline nodes
- Exclude STOPPED/STOPPING instances from capacity count
- Only count instances that will come online (PENDING/RUNNING)
- Fail-safe: if state check fails, count the instance to avoid over-provisioning

This preserves the fixes from:
- JENKINS-76151: EC2RetentionStrategy still only reconnects RUNNING instances
- JENKINS-76171: Offline PENDING/RUNNING instances still counted to prevent over-provisioning

Testing:
1. Stop an EC2 instance (via AWS or Jenkins stopOnTerminate)
2. Queue a job requiring that label
3. Verify provisioning is triggered and instance starts in AWS
4. Check logs for "Excluding STOPPED instance {id} from available capacity"
@mikecirioli mikecirioli force-pushed the fix/JENKINS-76200-start-in-reconnect-with-logging branch from 3350f5a to f810159 Compare October 15, 2025 12:03
@mikecirioli mikecirioli deleted the fix/JENKINS-76200-start-in-reconnect-with-logging branch October 15, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant