Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 12 additions & 5 deletions premerge/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ To balance cost/performance, we keep both types.
- building & testing LLVM shall be done on self-hosted runners.

LLVM has several flavor of self-hosted runners:
- libcxx runners.
- MacOS runners for HLSL managed by Microsoft.
- GCP windows/linux runners managed by Google.
- GCP linux runners setup for libcxx managed by Google.

This document only focuses on Google's GCP hosted runners.

Expand All @@ -47,10 +47,11 @@ Any relevant differences are explicitly enumerated.

Our runners are hosted on GCP Kubernetes clusters, and use the
[Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
The clusters have 3 pools:
The clusters have 4 main pools:
- llvm-premerge-linux
- llvm-premerge-linux-service
- llvm-premerge-windows
- llvm-premerge-libcxx

**llvm-premerge-linux-service** is a fixed pool, only used to host the
services required to manage the premerge infra (controller, listeners,
Expand All @@ -64,6 +65,11 @@ are `n2d-standard-64` due to quota limitations.
VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the
US West cluster, the machines are `n2d-standard-32` due to quota limitations.

**llvm-premerge-libcxx** is a auto-scaling pool with large `n2-standard-32`
VMs. This is similar to the Linux pool but with smaller machines tailored
to the libcxx testing workflows. In the US West Cluster, the machines are
`n2d-standard-32` due to quota limitations.

### Service pool: llvm-premerge-linux-service

This pool runs all the services managing the presubmit infra.
Expand All @@ -87,7 +93,7 @@ How a job is run:
- If the instance is not reused in the next 10 minutes, the autoscaler
will turn down the instance, freeing resources.

### Worker pools : llvm-premerge-linux, llvm-premerge-windows
### Worker pools : llvm-premerge-linux, llvm-premerge-windows, llvm-premerge-libcxx

To make sure each runner pod is scheduled on the correct pool (linux or
windows, avoiding the service pool), we use labels and taints.
Expand All @@ -98,6 +104,7 @@ So if we do not enforce limits, the controller could schedule 2 runners on
the same instance, forcing containers to share resources.

Those bits are configures in the
[linux runner configuration](linux_runners_values.yaml) and
[windows runner configuration](windows_runner_values.yaml).
[linux runner configuration](linux_runners_values.yaml),
[windows runner configuration](windows_runner_values.yaml), and
[libcxx runner configuration](libcxx_runners_values.yaml).

9 changes: 9 additions & 0 deletions premerge/cluster-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ will see 3 node pools:
- llvm-premerge-linux
- llvm-premerge-linux-service
- llvm-premerge-windows
- llvm-premerge-libcxx

Definitions for each pool are in [Architecture overview](architecture.md).

Expand Down Expand Up @@ -96,9 +97,11 @@ To apply any changes to the cluster:
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_libcxx
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_libcxx
terraform apply
```

Expand Down Expand Up @@ -145,6 +148,9 @@ on a kubernetes destroy command:
```bash
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_linux
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_windows
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_release
terraform destroy -target module.premerge_cluster_us_central_resources.helm_release.github_actions_runner_set_libcxx_next
```

These should complete, but if they do not, we are still able to get things
Expand All @@ -157,6 +163,9 @@ commands by deleting the kubernetes namespaces all the resources live in:
```bash
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_linux_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_windows_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_release_runners
terraform destroy -target module.premerge_cluster_us_central_resources.kubernetes_namespace.llvm_premerge_libcxx_next_runners
```

If things go smoothly, these should complete quickly. If they do not complete,
Expand Down