-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Context
The command hostmgr vm clean looks at the vm-usage file—that reports all starts of a new VM (logged as VM name + date)—to find all the VMs in that list that have not been launched in >90 days.
This command is run as part of the pre-exit hook of our buildkite-agents running on each macOS host, so that the cleanup of old local VMs is done regularly, thus freeing up space
The current implementation has a couple of issues though
The small issues around vm-usage
vm cleanlogs the same VMs over and over even if they have already been cleanedvm cleanlogic doesn't detect VMs that have never been used yet (though admittedly if that's the case it's unlikely that the vm would have been downloaded in the first place…)- cutoff date is not customizable
- We might want to log more data in
vm-usage - We might want to truncate old entries from
vm-usage
All of those could be solved at once using a single PR with little refactoring around VMUsageTracker+VMClean
vm clean logs already-cleaned VMs every time
Because the current implementation of vm clean looks at the vm-usage file only, and that file is not cleaned up / truncated, this means that even once old VMs have been deleted by a call to vm clean, the subsequent calls will still find those old VMs in the list from the vm-usage file, and log again that this VM will be deleted.
In practice, the call to try await vmManager.removeVM(name: image.vmName) does nothing if the VM we're trying to remove doesn't exist (i.e. was already removed before), but the log about that VM being removed is still printed in all cases.
That makes it look like every CI build tries to remove the same VMs over and over, as if it didn't succeed the previous time. While in practice it did, but it's just still finding the same VMs in the vm-usage file even if they have already been removed.
vm clean doesn't detect VMs that have never been used
The way vm clean is implemented today is that it looks at the vm-usage file first, and use that as the source to decide what VMs to delete, instead of starting from the list of VMs present locally and removing the ones not listed in vm-usage in the last 90 days. This means that a VM present locally but that has never been launched, and thus is not listed in vm-usage, will never be cleaned by vm clean.
This is an edge case that will likely never happen in production, because if a VM exists on one of our macOS host machines, it's likely only because it has been requested (and thus launched) by at least one build, otherwise it would never have been fetched (i.e. downloaded from S3—or our NAS mirroring our S3 bucket—into the machine) in the first place. But it still makes the logic inconsistent with the description of what the command is supposed to do in the first place.
The cutoff date is not customizable
Not that big a deal, but it feels odd for the "90 days cutoff date" of vm clean to be hardcoded. It'd probably make sense to make it configurable via hostmgr.json, even if 90 days remains the default value if not specified.
vm-usage could benefit from recording the CI pipeline+build
Currently vm-usage only records the VM name and launch date. This makes sense for its current purposes.
That being said, it could be interesting to also log the value of the BUILDKITE_SLUG and BUILDKITE_BUILD_NUMBER env vars if they are available, so that it'd be easy to check from that log file which Buildkite projects/repo are still requesting which VM.
For example, if we notice that an old xcode-15.1 VM is still present on a host and never cleaned, while we thought we already migrated all the repos to newer VMs… it could be cumbersome to look at all our known repos one by one to find which one might still use that old IMAGE_ID. But if we added the BUILDKITE_SLUG to the vm-usage log file, we'd be able to grep xcode-15.1 /opt/ci/hostmgr/state/vm-usage | cut -f1,3 | sort | uniq and immediately see which repos were still requesting that old VM.
And adding the BUILDKITE_BUILD_NUMBER could allow us to easily find the build that requested that VM in Buildkite (to see which commit/PR it was associated with, or see if it was a scheduled build rather than a build triggered via API or PR, investigate its logs in more details…)
vm-usage file is never cleaned or truncated
This is debatable if that's a problem or not.
- Over time, the more VMs are launched on an agent, the bigger the
vm-usagefile will get. So far this hasn't become a problem though, as that file is still a reasonable size despitehostmgrhaving started a lot of VM on those machines (example: on 001 that file has 8K lines, totaling 288KB) - On the other end, it could be nice to keep the whole history, especially if we implement the previous point about adding more info in that log file. Though adding those info means the file will grow faster as well.
Proposed solution
-
VMUsageTracker- Make
VMUsageTrackerrecord the values ofBUILDKITE_SLUGandBUILDKITE_BUILD_NUMBERif they are defined. - During parsing, ensure that it wouldn't be an error for the line not to have those 3rd and 4th column present though (it might be ok to ignore those 3rd and 4th column while parsing if the only intend for those would be for a manual
grep … | cut -f1,3 | sort | uniq; though it'd be nice to consider improvinghostmgr vm statscommand in the future so it can print more detailed information likevm stats --buildkite-slug --vm=xcode-15.1could print all the buildkite-slug values for a given VM, etc). - Add a method to remove old entries in the log file that are older than N days, to prevent the log file from growing too much. Allow the threshold (N days) to be configurable, though only allow values of N that are greater the cutoff duration for
vm cleanso that we are guaranteed to keep enough history forvm cleanto work as expected
- Make
-
VMCleancommand- Reimplement the method so that it calls
VMManager.list()to get the list of local VMs first, and then removes the ones from that list that are also invmManager.getVMImages(unusedSince: cutoff). While at it, it would probably be worth moving thatgetVMImages(unusedSince:)method fromVMManagertoVMUsageTrackerto make it clearer that it operates on thevm-usagetracker file and not the list of local VMs thatVMManageris all about manipulating. - Allow the cutoff date to be configured via the Configuration file
- Reimplement the method so that it calls