-
Notifications
You must be signed in to change notification settings - Fork 204
Cleanup EC2 VMs created by FIPS integration tests #10002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
| export ACCOUNT_KEY_SECRET=$(vault kv get -field=client_email $VAULT_PATH) | ||
| export ACCOUNT_SECRET=$(vault kv get -field=private_key $VAULT_PATH) | ||
| export ACCOUNT_PROJECT_SECRET=$(vault kv get -field=project_id $VAULT_PATH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if these are the correct fields in Vault for EC2? I copied these from the analogous gce-cleanup.sh file. @v1v?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to use a different approach here, and avoid accessing vault secrets at runtime - they are not masked.
The recommended approach is to use https://github.com/elastic/vault-secrets-buildkite-plugin/
However, I think we could use OIDC and avoid the situation of using static secrets:
- https://github.com/elastic/oblt-aws-auth-buildkite-plugin
- plus the OIDC configuration in the specific AWS account
| everyone: | ||
| access_level: BUILD_AND_READ | ||
|
|
||
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally suggest to create this catalog-info change in a first PR. Why? Buildkite integration with Terrazzo cannot be tested unless is merged.
So you could create a PR afterwards and point to your branch for testing purposes using the BK UI and validate changes work as expected during your PR development.
| filter_enabled: true | ||
| # required by "build_pull_requests: true" when used with buildkite-pr-bot | ||
| filter_condition: >- | ||
| build.pull_request.id == null || (build.creator.name == 'elasticmachine' && build.pull_request.id != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no BK PR bot, I think we can remove this section
| description: Clean up stale EC2 instances | ||
| links: | ||
| - title: Pipeline | ||
| url: https://buildkite.com/elastic/elastic-agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| url: https://buildkite.com/elastic/elastic-agent | |
| url: https://buildkite.com/elastic/elastic-agent-ec2-cleanup |
v1v
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation in the ingest-dev project could be used too. So we won't need this PR but only the changes from .buildkite/misc/ec2-cleanup.yml.
Did you mean the |
|
I actually got confused with the ESS cleanup and https://github.com/elastic/elastic-agent/blob/main/.buildkite/pipeline.elastic-agent-gce-cleanup.yml. Somehow, I thought it was in the private repository. I think using pipeline.elastic-agent-gce-cleanup.yml could be enough, it also runs every 4 hours, what do you think? |
The FIPS VMs are spun up in EC2, so how could we use the GCE cleanup script? |
|
Good point, the naming is not meaningful in that case - I'm certainly getting blind after the long day, hence my bad with my suggestions 🙇 Although, can you elaborate on what you mean by 'spun up in EC2'? IIUC, those VMs are managed by the BK provisioner Gobld, hence we don't need to clean them up - they are destroyed automatically after the step finished (regardless of the success) IIRC, the GCE cleanup was for the |
I linked to the line in my comment but basically the buildkite step has
Yeah, this was my initial thought too. The VMs should now be managed by Buildkite. However, I found that GCE VM cleanup pipeline is finding and cleaning up VMs some times (example). So it would seem something is still creating the OGC VMs and leaving them orphaned? |
Okay, I see you edited your comment to add the note about automatic cleanup regardless of success of the step. In that case, it seems like we don't need this PR after all, which is great! I do think we need to look into why the GCE cleanup pipeline is finding VMs sometimes but we can do that separately from this PR. Given all of the above, I will close this PR without merging, unless there are any objections. |
Thanks Shaunak for walking me through the specifics here 🙇
I see names that are not Maybe, developers run those ogc commands locally. I have not investigated much the state of the art for Thanks! |
|
💔 Build Failed
Failed CI Stepscc @ycombinator |





What does this PR do?
This PR adds a new Buildkite pipeline and associated script to clean up any lingering EC2 VMs created by FIPS integration tests. The pipeline is executed every 4 hours.
Why is it important?
Elastic Agent CI runs FIPS integration tests. These integration tests create FIPS-compliant VMs in EC2 where the tests are actually executed. Sometimes these CI builds fail without cleaning up said VMs.
Checklist
./changelog/fragmentsusing the changelog toolDisruptive User Impact
How to test this PR locally
Related issues
Questions to ask yourself