The cluster is managed using Terraform. The main configuration is main.tf.
Terraform is a tool to automate infrastructure deployment. Basic usage is to
change this configuration and to call terraform apply make the required
changes.
Terraform won't recreate the whole cluster from scratch every time, instead
it tries to only apply the new changes. To do so, Terraform needs a state.
If you apply changes without this state, you might break the cluster.
The current configuration stores its state into a GCP bucket.
This web interface is the easiest way to get a quick look at the infra.
IMPORTANT: cluster state is managed with terraform. Please DO NOT change shapes/scaling, and other settings using the cloud console. Any change not done through terraform will be at best overridden by terraform, and in the worst case cause an inconsistent state.
The main part you want too look into is Menu > Kubernetes Engine > Clusters.
Currently, we have 4 clusters:
llvm-premerge-checks: the cluster hosting BuildKite Linux runners.llvm-premerge-cluster-us-central: The first cluster for GCP hosted runners.llvm-premerge-cluster-us-west: The second cluster for GCP hosted runners.
llvm-premerge-checks is part of the old Buildkite
infrastructure. For the new infrastructure, we have two clusters,
llvm-premerge-cluster-us-central and llvm-premerge-cluster-us-west for GCP
hosted runners to form a high availability setup. They both load balance, and
if one fails then the other will pick up the work. This also enables seamless
migrations and upgrades.
To add a VM to a cluster, the VM has to come from a pool. A pool is
a group of nodes within a cluster that all have the same configuration.
For example:
A pool can say it contains at most 10 nodes, each using the c2d-highcpu-32
configuration (32 cores, 64GB ram).
In addition, a pool can autoscale docs.
If you click on llvm-premerge-cluster-us-central, and go to the Nodes tab, you
will see 3 node pools:
- llvm-premerge-linux
- llvm-premerge-linux-service
- llvm-premerge-windows-2022
- llvm-premerge-libcxx
Definitions for each pool are in Architecture overview.
If you click on a pool, example llvm-premerge-linux, you will see one
instance group, and maybe several nodes.
Each created node must be attached to an instance group, which is used to manage a group of instances. Because we use automated autoscale, and we have a basic cluster, we have a single instance group per pool.
Then, we have the nodes. If you are looking at the panel during off hours, you might see no nodes at all: when no presubmit is running, no VM is on. If you are looking at the panel at peak time, you should see 8 instances. (Today, autoscale is capped at 8 instances).
If you click on a node, you'll see the CPU usage, memory usage, and can access the logs for each instance.
As long as you don't click on actions like Cordon, Edit, Delete, etc,
navigating the GCP panel should not cause any harm. So feel free to look
around to familiarize yourself with the interface.
- install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform)
- get the GCP tokens:
gcloud auth application-default login - initialize terraform:
terraform init
To apply any changes to the cluster:
- setup the cluster:
terraform apply - terraform will list the list of proposed changes.
- enter 'yes' when prompted.
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows_2022
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_libcxx
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows_2022
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_libcxx
terraform apply
Setting the cluster up for the first time is more involved as there are certain
resources where terraform is unable to handle explicit dependencies. This means
that we have to set up the GKE cluster before we setup any of the Kubernetes
resources as otherwise the Terraform Kubernetes provider will error out. This
needs to be done for both clusters before running the standard
terraform apply.
Updating and resetting the Github Actions Runner Controller (ARC) within the cluster involves largely the same process. Some special considerations need to be made with how ARC interacts with kubernetes. The process involves uninstalling the runner scale set charts, deleting the namespaces to ensure everything is properly cleaned up, optionally bumping the version number if this is a version upgrade, and then reinstalling the charts to get the cluster back to accepting production jobs.
It is important to not just blindly delete controller pods or namespaces as this (at least empirically) can interrupt the state and custom resources that ARC manages, then requiring a costly full uninstallation and reinstallation of at least a runner scale set.
When upgrading/resetting the cluster, jobs will not be lost, but instead remain
queued on the Github side. Running build jobs will complete after the helm charts
are uninstalled unless they are forcibly killed. Note that best practice dictates
the helm charts should just be uninstalled rather than also setting maxRunners
to zero beforehand as that can cause ARC to accept some jobs but not actually
execute them which could prevent failover in a HA cluster configuration like
ours.
For the example commands below we will be modifying the cluster in
us-central1-a. You can replace module.premerge_cluster_us_central_resources
with module.premerge_cluster_us_west_resources to switch which cluster you
are working on.
To begin, start by uninstalling the helm charts by using resource targetting on a kubernetes destroy command:
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_linux
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_windows_2022
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_release
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_nextThese should complete, but if they do not, we are still able to get things
cleaned up. If everything went smoothly, the commands should complete and leave
runner pods that are still in the process of executing jobs. You will need to
wait for them to complete before moving on. If they are stuck, you will need to
manually delete them with kubectl delete. Follow up the previous terraform
commands by deleting the kubernetes namespaces all the resources live in:
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_linux_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_windows_2022_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_release_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_next_runnersIf things go smoothly, these should complete quickly. If they do not complete,
there is most likely dangling resources in the namespaces that need to have their
finalizers removed before they can be updated. You can confirm this by running
kubectl get namespaces. If the namespace is listed as Terminating, you most
likely need to manually intervene. To find a list of dangling resources that
did not get cleaned up properly, you can run the following command, making sure
to fill in <namespace> with the actual namespace of interest:
kubectl api-resources --verbs=list --namespaced -o name \
| xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>This will return the stuck resources. Then you can copy each resource, and edit the YAML configuration of the kubernetes object to remove the finalizers:
kubectl edit <resource name> -n <namespace name>Just deleting the finalizers key along with any entries should be sufficient.
After rerunning the command to find dangling resources, you should see it get
removed. After doing this for all dangling resources, the namespace should
then delete automatically. This can be confirmed by running
kubectl get namespaces.
If you are performing these steps as part of an incident response, you can skip to the section Bringing the Cluster Back Up. If you are bumping the version you still need to uninstall the controller and bump the version number beforehand.
Next, the controller helm chart needs to be uninstalled. If you are performing these steps as part of dealing with an incident, you most likely do not need to perform this step. Usually it is sufficient to destroy and recreate the runner scale sets to resolve incidents. Uninstalling the controller is necessary for version upgrades however.
Start by destroying the helm chart:
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_controllerThen delete the namespace to ensure there are no dangling resources
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_controllerThis is necessary only for bumping the version of ARC. This involves simply
updating the version field for the helm_release objects in main.tf. Make sure
to commit the changes and push them to llvm-zorg to ensure others working on
the terraform configuration have an up to date state when they pull the repository.
To get the cluster back up and accepting production jobs again, simply run
terraform apply. It will recreate all the resource previously destroyed and
ensure they are in a state consistent with the terraform IaC definitions.
Strategies for Upgrading ARC outlines how ARC should be upgraded and why.