diff --git a/premerge/architecture.md b/premerge/architecture.md index c77733ad9..ab13e6a74 100644 --- a/premerge/architecture.md +++ b/premerge/architecture.md @@ -20,9 +20,9 @@ To balance cost/performance, we keep both types. - building & testing LLVM shall be done on self-hosted runners. LLVM has several flavor of self-hosted runners: - - libcxx runners. - MacOS runners for HLSL managed by Microsoft. - GCP windows/linux runners managed by Google. + - GCP linux runners setup for libcxx managed by Google. This document only focuses on Google's GCP hosted runners. @@ -47,10 +47,11 @@ Any relevant differences are explicitly enumerated. Our runners are hosted on GCP Kubernetes clusters, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). -The clusters have 3 pools: +The clusters have 4 main pools: - llvm-premerge-linux - llvm-premerge-linux-service - llvm-premerge-windows + - llvm-premerge-libcxx **llvm-premerge-linux-service** is a fixed pool, only used to host the services required to manage the premerge infra (controller, listeners, @@ -64,6 +65,11 @@ are `n2d-standard-64` due to quota limitations. VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the US West cluster, the machines are `n2d-standard-32` due to quota limitations. +**llvm-premerge-libcxx** is a auto-scaling pool with large `n2-standard-32` +VMs. This is similar to the Linux pool but with smaller machines tailored +to the libcxx testing workflows. In the US West Cluster, the machines are +`n2d-standard-32` due to quota limitations. + ### Service pool: llvm-premerge-linux-service This pool runs all the services managing the presubmit infra. @@ -87,7 +93,7 @@ How a job is run: - If the instance is not reused in the next 10 minutes, the autoscaler will turn down the instance, freeing resources. -### Worker pools : llvm-premerge-linux, llvm-premerge-windows +### Worker pools : llvm-premerge-linux, llvm-premerge-windows, llvm-premerge-libcxx To make sure each runner pod is scheduled on the correct pool (linux or windows, avoiding the service pool), we use labels and taints. @@ -98,6 +104,7 @@ So if we do not enforce limits, the controller could schedule 2 runners on the same instance, forcing containers to share resources. Those bits are configures in the -[linux runner configuration](linux_runners_values.yaml) and -[windows runner configuration](windows_runner_values.yaml). +[linux runner configuration](linux_runners_values.yaml), +[windows runner configuration](windows_runner_values.yaml), and +[libcxx runner configuration](libcxx_runners_values.yaml). diff --git a/premerge/cluster-management.md b/premerge/cluster-management.md index ff841a67f..c3bfb330c 100644 --- a/premerge/cluster-management.md +++ b/premerge/cluster-management.md @@ -57,6 +57,7 @@ will see 3 node pools: - llvm-premerge-linux - llvm-premerge-linux-service - llvm-premerge-windows +- llvm-premerge-libcxx Definitions for each pool are in [Architecture overview](architecture.md). @@ -96,9 +97,11 @@ To apply any changes to the cluster: terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows +terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_libcxx terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows +terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_libcxx terraform apply ``` @@ -145,6 +148,9 @@ on a kubernetes destroy command: ```bash terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_linux terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_windows +terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx +terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_release +terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_next ``` These should complete, but if they do not, we are still able to get things @@ -157,6 +163,9 @@ commands by deleting the kubernetes namespaces all the resources live in: ```bash terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_linux_runners terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_windows_runners +terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_runners +terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_release_runners +terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_next_runners ``` If things go smoothly, these should complete quickly. If they do not complete,