Skip to content

Add dangling instance check for Windows#280

Open
scadu wants to merge 10 commits intomasterfrom
feat/elastic_ci_mode_windows_SUP-5127
Open

Add dangling instance check for Windows#280
scadu wants to merge 10 commits intomasterfrom
feat/elastic_ci_mode_windows_SUP-5127

Conversation

@scadu
Copy link
Contributor

@scadu scadu commented Feb 2, 2026

Summary

This PR extends the dangling instance detection in Elastic CI Mode to support Windows instances. The scaler now correctly detects Windows platform and uses PowerShell to check agent service status.

Changes

  • For dangling instance check, replace EC2 instance termination with marking EC2 instance in ASG
  • Platform detection:
    • Use strings.EqualFold for case-insensitive comparison (EC2 API returns "windows" lowercase, SDK constant is "Windows")
    • AWS-RunPowerShellScript for Windows, AWS-RunShellScript for Linux
    • Use nssm status buildkite-agent to detect service state
  • IAM: add autoscaling:SetInstanceHealth and AWS-RunPowerShellScript to the Lambda role
  • Add tests for platform detection, check command generation, and SSM document selection

Testing

To test dangling instance mechanism, I did:

  1. Rename terminate-instance.ps1 to terminate-instance.ps1.bak, so it's not triggered via systemd/nssm post-exit event
  2. Stopped the buildkite-agent service and let the Agent Scaler do its job

Sample logs:

[Elastic CI Mode] Detected platform: windows
[Elastic CI Mode] Performing detailed SSM check for 1 candidate instance(s): [i-XXXXXXXXXXX]
[Elastic CI Mode] 🧟 Found dangling instance i-XXXXXXXXXXX - buildkite-agent service is not running or check command failed
[Elastic CI Mode] Service status for i-XXXXXXXXXXX: NOT_RUNNING
[Elastic CI Mode] Successfully marked instance i-XXXXXXXXXXX as unhealthy
[Elastic CI Mode] Dangling instance check complete for ASG XXXXXXXXXXX-AgentAutoScaleGroup-XXXXXXXXXXX. Considered: 1, Actually Sent for SSM Check: 1, Marked as Unhealthy this run: 1

scadu added 4 commits February 2, 2026 12:16
Reduce log clutter, move Elastic CI Mode "banner" to NewScaler(), so
it's printed only once instead of each iteration (could be 5-6 times
before the 2 minutes default timeout is hit).
@scadu scadu requested a review from a team February 2, 2026 13:44
@scadu scadu marked this pull request as ready for review February 2, 2026 15:17
@scadu scadu requested a review from a team as a code owner February 2, 2026 15:17
@scadu scadu changed the title Add dangling instance check for Windows instances Add dangling instance check for Windows Feb 2, 2026
Copy link
Contributor

@petetomasik petetomasik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions added.

@scadu scadu force-pushed the feat/elastic_ci_mode_windows_SUP-5127 branch from ea65159 to a7ce02f Compare February 5, 2026 11:38
@scadu scadu marked this pull request as draft February 5, 2026 14:12
@scadu scadu force-pushed the feat/elastic_ci_mode_windows_SUP-5127 branch from 18a9e14 to 95fc49c Compare February 5, 2026 15:05
scadu added a commit that referenced this pull request Feb 5, 2026
@scadu scadu mentioned this pull request Feb 5, 2026
@scadu scadu force-pushed the feat/elastic_ci_mode_windows_SUP-5127 branch from 95fc49c to aa524d1 Compare February 5, 2026 15:36
scadu added a commit that referenced this pull request Feb 5, 2026
Addresses #280 (comment)

Fixes missing marker during initial implementation #212 where `SIGTERM` could be sent twice to the same instance, resulting in stopping agent process immediately, without giving ongoing jobs a chance to finish.
@scadu scadu marked this pull request as ready for review February 6, 2026 09:59
@scadu scadu requested a review from petetomasik February 6, 2026 09:59
@scadu scadu marked this pull request as draft February 6, 2026 10:03
@scadu scadu force-pushed the feat/elastic_ci_mode_windows_SUP-5127 branch from f856db8 to 5013210 Compare February 6, 2026 10:43
@scadu scadu marked this pull request as ready for review February 6, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants