Skip to content

Commit 97c01e2

Browse files
[CI] Update the documentation based on recent changes
The biggest change was the move to a HA setup. This commit mostly consists of an entire pass over the docs to find any inconsistencies/outdated issues though. Reviewers: Keenuts, lnihlen, dschuff, gburgessiv, cmtice Reviewed By: dschuff Pull Request: #446
1 parent db7569a commit 97c01e2

File tree

3 files changed

+44
-23
lines changed

3 files changed

+44
-23
lines changed

premerge/architecture.md

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,13 @@ Our self hosted runners come in two flavors:
4141

4242
## GCP runners - Architecture overview
4343

44-
Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
45-
The cluster has 3 pools:
44+
We have two clusters to compose a high availability setup. The description
45+
below describes an individual cluster, but they are largely identical.
46+
Any relevant differences are explicitly enumerated.
47+
48+
Our runners are hosted on GCP Kubernetes clusters, and use the
49+
[Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
50+
The clusters have 3 pools:
4651
- llvm-premerge-linux
4752
- llvm-premerge-linux-service
4853
- llvm-premerge-windows
@@ -52,10 +57,12 @@ services required to manage the premerge infra (controller, listeners,
5257
monitoring). Today, this pool has three `e2-highcpu-4` machine.
5358

5459
**llvm-premerge-linux** is a auto-scaling pool with large `n2-standard-64`
55-
VMs. This pool runs the Linux workflows.
60+
VMs. This pool runs the Linux workflows. In the US West cluster, the machines
61+
are `n2d-standard-64` due to quota limitations.
5662

57-
**llvm-premerge-windows** is a auto-scaling pool with large `n2-standard-64`
58-
VMs. Similar to the Linux pool, but this time it runs Windows workflows.
63+
**llvm-premerge-windows** is a auto-scaling pool with large `n2-standard-32`
64+
VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the
65+
US West cluster, the machines are `n2d-standard-32` due to quota limitations.
5966

6067
### Service pool: llvm-premerge-linux-service
6168

premerge/cluster-management.md

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -31,15 +31,19 @@ worst case cause an inconsistent state.
3131

3232
The main part you want too look into is `Menu > Kubernetes Engine > Clusters`.
3333

34-
Currently, we have 3 clusters:
34+
Currently, we have 4 clusters:
3535
- `llvm-premerge-checks`: the cluster hosting BuildKite Linux runners.
3636
- `windows-cluster`: the cluster hosting BuildKite Windows runners.
37-
- `llvm-premerge-prototype`: the cluster for those GCP hoster runners.
37+
- `llvm-premerge-prototype`: The first cluster for GCP hosted runners.
38+
- `llvm-premerge-cluster-us-west`: The second cluster for GCP hosted runners.
3839

39-
Yes, it's called `prototype`, but that's the production cluster.
40-
We should rename it at some point.
40+
Yes, one is called `prototype`, but that's the production cluster.
41+
We should rename it at some point. We have two clusters for GCP hosted runners
42+
to form a high availability setup. They both load balance, and if one fails
43+
then the other will pick up the work. This also enables seamless migrations
44+
and upgrades.
4145

42-
To add a VM to the cluster, the VM has to come from a `pool`. A `pool` is
46+
To add a VM to a cluster, the VM has to come from a `pool`. A `pool` is
4347
a group of nodes within a cluster that all have the same configuration.
4448

4549
For example:
@@ -88,16 +92,21 @@ To apply any changes to the cluster:
8892
## Setting the cluster up for the first time
8993

9094
```
91-
terraform apply -target google_container_node_pool.llvm_premerge_linux_service
92-
terraform apply -target google_container_node_pool.llvm_premerge_linux
93-
terraform apply -target google_container_node_pool.llvm_premerge_windows
95+
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service
96+
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux
97+
terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows
98+
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service
99+
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux
100+
terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows
94101
terraform apply
95102
```
96103

97104
Setting the cluster up for the first time is more involved as there are certain
98105
resources where terraform is unable to handle explicit dependencies. This means
99106
that we have to set up the GKE cluster before we setup any of the Kubernetes
100-
resources as otherwise the Terraform Kubernetes provider will error out.
107+
resources as otherwise the Terraform Kubernetes provider will error out. This
108+
needs to be done for both clusters before running the standard
109+
`terraform apply`.
101110

102111
## Upgrading/Resetting Github ARC
103112

@@ -119,16 +128,21 @@ queued on the Github side. Running build jobs will complete after the helm chart
119128
are uninstalled unless they are forcibly killed. Note that best practice dictates
120129
the helm charts should just be uninstalled rather than also setting `maxRunners`
121130
to zero beforehand as that can cause ARC to accept some jobs but not actually
122-
execute them which could prevent failover in HA cluster configurations.
131+
execute them which could prevent failover in a HA cluster configuration like
132+
ours.
123133

124134
### Uninstalling the Helm Charts
125135

136+
For the example commands below we will be modifying the cluster in
137+
`us-central1-a`. You can replace `module.premerge_cluster_us_central` with
138+
`module.premerge_cluster_us_west` to switch which cluster you are working on.
139+
126140
To begin, start by uninstalling the helm charts by using resource targetting
127141
on a kubernetes destroy command:
128142

129143
```bash
130-
terraform destroy -target helm_release.github_actions_runner_set_linux
131-
terraform destroy -target helm_release.github_actions_runner_set_windows
144+
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_set_linux
145+
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_set_windows
132146
```
133147

134148
These should complete, but if they do not, we are still able to get things
@@ -139,8 +153,8 @@ manually delete them with `kubectl delete`. Follow up the previous terraform
139153
commands by deleting the kubernetes namespaces all the resources live in:
140154

141155
```bash
142-
terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
143-
terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
156+
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_linux_runners
157+
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_windows_runners
144158
```
145159

146160
If things go smoothly, these should complete quickly. If they do not complete,
@@ -184,17 +198,17 @@ version upgrades however.
184198

185199
Start by destroying the helm chart:
186200
```bash
187-
terraform destroy -target helm_release.github_actions_runner_controller
201+
terraform destroy -target module.premerge_cluster_us_central.helm_release.github_actions_runner_controller
188202
```
189203

190204
Then delete the namespace to ensure there are no dangling resources
191205
```bash
192-
terraform destroy -target kubernetes_namespace.llvm_premerge_controller
206+
terraform destroy -target module.premerge_cluster_us_central.kubernetes_namespace.llvm_premerge_controller
193207
```
194208

195209
### Bumping the Version Number
196210

197-
This is not necessary only for bumping the version of ARC. This involves simply
211+
This is necessary only for bumping the version of ARC. This involves simply
198212
updating the version field for the `helm_release` objects in `main.tf`. Make sure
199213
to commit the changes and push them to `llvm-zorg` to ensure others working on
200214
the terraform configuration have an up to date state when they pull the repository.

premerge/monitoring.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
Presubmit monitoring is provided by Grafana.
44
The dashboard link is [https://llvm.grafana.net/dashboards](https://llvm.grafana.net/dashboards).
55

6-
Grafana pulls its data from 2 sources: the GCP Kubernetes cluster & GitHub.
6+
Grafana pulls its data from 2 sources: the GCP Kubernetes clusters & GitHub.
77
Grafana instance access is restricted, but there is a publicly visible dashboard:
88
- [Public dashboard](https://llvm.grafana.net/public-dashboards/21c6e0a7cdd14651a90e118df46be4cc)
99

0 commit comments

Comments
 (0)