Skip to content

Improve/fix implementation of hostmgr vm clean and vm-usage / VMUsageTracker #107

@AliSoftware

Description

@AliSoftware

Context

The command hostmgr vm clean looks at the vm-usage file—that reports all starts of a new VM (logged as VM name + date)—to find all the VMs in that list that have not been launched in >90 days.

This command is run as part of the pre-exit hook of our buildkite-agents running on each macOS host, so that the cleanup of old local VMs is done regularly, thus freeing up space

The current implementation has a couple of issues though

The small issues around vm-usage

  • vm clean logs the same VMs over and over even if they have already been cleaned
  • vm clean logic doesn't detect VMs that have never been used yet (though admittedly if that's the case it's unlikely that the vm would have been downloaded in the first place…)
  • cutoff date is not customizable
  • We might want to log more data in vm-usage
  • We might want to truncate old entries from vm-usage

All of those could be solved at once using a single PR with little refactoring around VMUsageTracker+VMClean

vm clean logs already-cleaned VMs every time

Because the current implementation of vm clean looks at the vm-usage file only, and that file is not cleaned up / truncated, this means that even once old VMs have been deleted by a call to vm clean, the subsequent calls will still find those old VMs in the list from the vm-usage file, and log again that this VM will be deleted.

In practice, the call to try await vmManager.removeVM(name: image.vmName) does nothing if the VM we're trying to remove doesn't exist (i.e. was already removed before), but the log about that VM being removed is still printed in all cases.
That makes it look like every CI build tries to remove the same VMs over and over, as if it didn't succeed the previous time. While in practice it did, but it's just still finding the same VMs in the vm-usage file even if they have already been removed.

vm clean doesn't detect VMs that have never been used

The way vm clean is implemented today is that it looks at the vm-usage file first, and use that as the source to decide what VMs to delete, instead of starting from the list of VMs present locally and removing the ones not listed in vm-usage in the last 90 days. This means that a VM present locally but that has never been launched, and thus is not listed in vm-usage, will never be cleaned by vm clean.

This is an edge case that will likely never happen in production, because if a VM exists on one of our macOS host machines, it's likely only because it has been requested (and thus launched) by at least one build, otherwise it would never have been fetched (i.e. downloaded from S3—or our NAS mirroring our S3 bucket—into the machine) in the first place. But it still makes the logic inconsistent with the description of what the command is supposed to do in the first place.

The cutoff date is not customizable

Not that big a deal, but it feels odd for the "90 days cutoff date" of vm clean to be hardcoded. It'd probably make sense to make it configurable via hostmgr.json, even if 90 days remains the default value if not specified.

vm-usage could benefit from recording the CI pipeline+build

Currently vm-usage only records the VM name and launch date. This makes sense for its current purposes.
That being said, it could be interesting to also log the value of the BUILDKITE_SLUG and BUILDKITE_BUILD_NUMBER env vars if they are available, so that it'd be easy to check from that log file which Buildkite projects/repo are still requesting which VM.

For example, if we notice that an old xcode-15.1 VM is still present on a host and never cleaned, while we thought we already migrated all the repos to newer VMs… it could be cumbersome to look at all our known repos one by one to find which one might still use that old IMAGE_ID. But if we added the BUILDKITE_SLUG to the vm-usage log file, we'd be able to grep xcode-15.1 /opt/ci/hostmgr/state/vm-usage | cut -f1,3 | sort | uniq and immediately see which repos were still requesting that old VM.
And adding the BUILDKITE_BUILD_NUMBER could allow us to easily find the build that requested that VM in Buildkite (to see which commit/PR it was associated with, or see if it was a scheduled build rather than a build triggered via API or PR, investigate its logs in more details…)

vm-usage file is never cleaned or truncated

This is debatable if that's a problem or not.

  • Over time, the more VMs are launched on an agent, the bigger the vm-usage file will get. So far this hasn't become a problem though, as that file is still a reasonable size despite hostmgr having started a lot of VM on those machines (example: on 001 that file has 8K lines, totaling 288KB)
  • On the other end, it could be nice to keep the whole history, especially if we implement the previous point about adding more info in that log file. Though adding those info means the file will grow faster as well.

Proposed solution

  • VMUsageTracker

    • Make VMUsageTracker record the values of BUILDKITE_SLUG and BUILDKITE_BUILD_NUMBER if they are defined.
    • During parsing, ensure that it wouldn't be an error for the line not to have those 3rd and 4th column present though (it might be ok to ignore those 3rd and 4th column while parsing if the only intend for those would be for a manual grep … | cut -f1,3 | sort | uniq; though it'd be nice to consider improving hostmgr vm stats command in the future so it can print more detailed information like vm stats --buildkite-slug --vm=xcode-15.1 could print all the buildkite-slug values for a given VM, etc).
    • Add a method to remove old entries in the log file that are older than N days, to prevent the log file from growing too much. Allow the threshold (N days) to be configurable, though only allow values of N that are greater the cutoff duration for vm clean so that we are guaranteed to keep enough history for vm clean to work as expected
  • VMClean command

    • Reimplement the method so that it calls VMManager.list() to get the list of local VMs first, and then removes the ones from that list that are also in vmManager.getVMImages(unusedSince: cutoff). While at it, it would probably be worth moving that getVMImages(unusedSince:) method from VMManager to VMUsageTracker to make it clearer that it operates on the vm-usage tracker file and not the list of local VMs that VMManager is all about manipulating.
    • Allow the cutoff date to be configured via the Configuration file

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requesthostmgr

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions