Skip to content

Conversation

@ycombinator
Copy link
Contributor

What does this PR do?

This PR adds a new Buildkite pipeline and associated script to clean up any lingering EC2 VMs created by FIPS integration tests. The pipeline is executed every 4 hours.

Why is it important?

Elastic Agent CI runs FIPS integration tests. These integration tests create FIPS-compliant VMs in EC2 where the tests are actually executed. Sometimes these CI builds fail without cleaning up said VMs.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify
Copy link
Contributor

mergify bot commented Sep 17, 2025

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Comment on lines +5 to +7
export ACCOUNT_KEY_SECRET=$(vault kv get -field=client_email $VAULT_PATH)
export ACCOUNT_SECRET=$(vault kv get -field=private_key $VAULT_PATH)
export ACCOUNT_PROJECT_SECRET=$(vault kv get -field=project_id $VAULT_PATH)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if these are the correct fields in Vault for EC2? I copied these from the analogous gce-cleanup.sh file. @v1v?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to use a different approach here, and avoid accessing vault secrets at runtime - they are not masked.

The recommended approach is to use https://github.com/elastic/vault-secrets-buildkite-plugin/

However, I think we could use OIDC and avoid the situation of using static secrets:

@ycombinator ycombinator marked this pull request as ready for review September 22, 2025 19:00
@ycombinator ycombinator requested a review from a team as a code owner September 22, 2025 19:00
@ycombinator ycombinator requested a review from v1v September 22, 2025 19:00
everyone:
access_level: BUILD_AND_READ

---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally suggest to create this catalog-info change in a first PR. Why? Buildkite integration with Terrazzo cannot be tested unless is merged.

So you could create a PR afterwards and point to your branch for testing purposes using the BK UI and validate changes work as expected during your PR development.

Comment on lines +472 to +475
filter_enabled: true
# required by "build_pull_requests: true" when used with buildkite-pr-bot
filter_condition: >-
build.pull_request.id == null || (build.creator.name == 'elasticmachine' && build.pull_request.id != null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no BK PR bot, I think we can remove this section

description: Clean up stale EC2 instances
links:
- title: Pipeline
url: https://buildkite.com/elastic/elastic-agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
url: https://buildkite.com/elastic/elastic-agent
url: https://buildkite.com/elastic/elastic-agent-ec2-cleanup

Copy link
Member

@v1v v1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation in the ingest-dev project could be used too. So we won't need this PR but only the changes from .buildkite/misc/ec2-cleanup.yml.

@ycombinator
Copy link
Contributor Author

I think the current implementation in the ingest-dev project could be used too. So we won't need this PR but only the changes from .buildkite/misc/ec2-cleanup.yml.

Did you mean the ingest-dev repo? If so, what implementation are you referring to? I'm not finding anything that uses cloud-reaper under https://github.com/elastic/ingest-dev/tree/main/.buildkite.

@v1v
Copy link
Member

v1v commented Sep 22, 2025

I actually got confused with the ESS cleanup and https://github.com/elastic/elastic-agent/blob/main/.buildkite/pipeline.elastic-agent-gce-cleanup.yml. Somehow, I thought it was in the private repository.

I think using pipeline.elastic-agent-gce-cleanup.yml could be enough, it also runs every 4 hours, what do you think?

@ycombinator
Copy link
Contributor Author

I think using pipeline.elastic-agent-gce-cleanup.yml could be enough, it also runs every 4 hours, what do you think?

The FIPS VMs are spun up in EC2, so how could we use the GCE cleanup script?

@v1v
Copy link
Member

v1v commented Sep 22, 2025

Good point, the naming is not meaningful in that case - I'm certainly getting blind after the long day, hence my bad with my suggestions 🙇


Although, can you elaborate on what you mean by 'spun up in EC2'? IIUC, those VMs are managed by the BK provisioner Gobld, hence we don't need to clean them up - they are destroyed automatically after the step finished (regardless of the success)

IIRC, the GCE cleanup was for the ogc when running ITs, but IIRC, that's not the case anymore

@ycombinator
Copy link
Contributor Author

Although can you elaborate what you mean with spun up in EC2? IIUC, those VMs are managed by the BK provisioner Gobld

I linked to the line in my comment but basically the buildkite step has agents.provider: "aws". Doesn't this mean the VM used that step is provisioned in EC2?

IIRC, the GCE cleanup was for the ogc when running ITs, but IIRC, that's not the case anymore

Yeah, this was my initial thought too. The VMs should now be managed by Buildkite. However, I found that GCE VM cleanup pipeline is finding and cleaning up VMs some times (example). So it would seem something is still creating the OGC VMs and leaving them orphaned?

@ycombinator
Copy link
Contributor Author

Although, can you elaborate on what you mean by 'spun up in EC2'? IIUC, those VMs are managed by the BK provisioner Gobld, hence we don't need to clean them up - they are destroyed automatically after the step finished (regardless of the success)

Okay, I see you edited your comment to add the note about automatic cleanup regardless of success of the step. In that case, it seems like we don't need this PR after all, which is great!

I do think we need to look into why the GCE cleanup pipeline is finding VMs sometimes but we can do that separately from this PR.

Given all of the above, I will close this PR without merging, unless there are any objections.

@v1v
Copy link
Member

v1v commented Sep 22, 2025

Given all of the above, I will close this PR without merging, unless there are any objections.

Thanks Shaunak for walking me through the specifics here 🙇

it would seem something is still creating the OGC VMs and leaving them orphaned?

I see names that are not elastic-agent but other things, name=ogc-linux-amd64-ubuntu-2404-fleet-endpoint-security-d2ce state=running ):

Maybe, developers run those ogc commands locally. I have not investigated much the state of the art for ogc in the ingest projects though, but I agree we can do it in another PR or GH issues discussion.

Thanks!

@elastic-sonarqube
Copy link

@elasticmachine
Copy link
Contributor

💔 Build Failed

Failed CI Steps

cc @ycombinator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants