@@ -31,15 +31,19 @@ worst case cause an inconsistent state.
31
31
32
32
The main part you want too look into is ` Menu > Kubernetes Engine > Clusters ` .
33
33
34
- Currently, we have 3 clusters:
34
+ Currently, we have 4 clusters:
35
35
- ` llvm-premerge-checks ` : the cluster hosting BuildKite Linux runners.
36
36
- ` windows-cluster ` : the cluster hosting BuildKite Windows runners.
37
- - ` llvm-premerge-prototype ` : the cluster for those GCP hoster runners.
37
+ - ` llvm-premerge-prototype ` : The first cluster for GCP hosted runners.
38
+ - ` llvm-premerge-cluster-us-west ` : The second cluster for GCP hosted runners.
38
39
39
- Yes, it's called ` prototype ` , but that's the production cluster.
40
- We should rename it at some point.
40
+ Yes, one is called ` prototype ` , but that's the production cluster.
41
+ We should rename it at some point. We have two clusters for GCP hosted runners
42
+ to form a high availability setup. They both load balance, and if one fails
43
+ then the other will pick up the work. This also enables seamless migrations
44
+ and upgrades.
41
45
42
- To add a VM to the cluster, the VM has to come from a ` pool ` . A ` pool ` is
46
+ To add a VM to a cluster, the VM has to come from a ` pool ` . A ` pool ` is
43
47
a group of nodes within a cluster that all have the same configuration.
44
48
45
49
For example:
@@ -88,16 +92,21 @@ To apply any changes to the cluster:
88
92
## Setting the cluster up for the first time
89
93
90
94
```
91
- terraform apply -target google_container_node_pool.llvm_premerge_linux_service
92
- terraform apply -target google_container_node_pool.llvm_premerge_linux
93
- terraform apply -target google_container_node_pool.llvm_premerge_windows
95
+ terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux_service
96
+ terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_linux
97
+ terraform apply -target module.premerge_cluster_us_central.google_container_node_pool.llvm_premerge_windows
98
+ terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux_service
99
+ terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_linux
100
+ terraform apply -target module.premerge_cluster_us_west.google_container_node_pool.llvm_premerge_windows
94
101
terraform apply
95
102
```
96
103
97
104
Setting the cluster up for the first time is more involved as there are certain
98
105
resources where terraform is unable to handle explicit dependencies. This means
99
106
that we have to set up the GKE cluster before we setup any of the Kubernetes
100
- resources as otherwise the Terraform Kubernetes provider will error out.
107
+ resources as otherwise the Terraform Kubernetes provider will error out. This
108
+ needs to be done for both clusters before running the standard
109
+ ` terraform apply ` .
101
110
102
111
## Upgrading/Resetting Github ARC
103
112
@@ -119,16 +128,21 @@ queued on the Github side. Running build jobs will complete after the helm chart
119
128
are uninstalled unless they are forcibly killed. Note that best practice dictates
120
129
the helm charts should just be uninstalled rather than also setting ` maxRunners `
121
130
to zero beforehand as that can cause ARC to accept some jobs but not actually
122
- execute them which could prevent failover in HA cluster configurations.
131
+ execute them which could prevent failover in a HA cluster configuration like
132
+ ours.
123
133
124
134
### Uninstalling the Helm Charts
125
135
136
+ For the example commands below we will be modifying the cluster in
137
+ ` us-central1-a ` . You can replace ` module.premerge_cluster_us_central ` with
138
+ ` module.premerge_cluster_us_west ` to switch which cluster you are working on.
139
+
126
140
To begin, start by uninstalling the helm charts by using resource targetting
127
141
on a kubernetes destroy command:
128
142
129
143
``` bash
130
- terraform destroy -target helm_release.github_actions_runner_set_linux
131
- terraform destroy -target helm_release.github_actions_runner_set_windows
144
+ terraform destroy -target module.premerge_cluster_us_central. helm_release.github_actions_runner_set_linux
145
+ terraform destroy -target module.premerge_cluster_us_central. helm_release.github_actions_runner_set_windows
132
146
```
133
147
134
148
These should complete, but if they do not, we are still able to get things
@@ -139,8 +153,8 @@ manually delete them with `kubectl delete`. Follow up the previous terraform
139
153
commands by deleting the kubernetes namespaces all the resources live in:
140
154
141
155
``` bash
142
- terraform destroy -target kubernetes_namespace.llvm_premerge_linux_runners
143
- terraform destroy -target kubernetes_namespace.llvm_premerge_windows_runners
156
+ terraform destroy -target module.premerge_cluster_us_central. kubernetes_namespace.llvm_premerge_linux_runners
157
+ terraform destroy -target module.premerge_cluster_us_central. kubernetes_namespace.llvm_premerge_windows_runners
144
158
```
145
159
146
160
If things go smoothly, these should complete quickly. If they do not complete,
@@ -184,17 +198,17 @@ version upgrades however.
184
198
185
199
Start by destroying the helm chart:
186
200
``` bash
187
- terraform destroy -target helm_release.github_actions_runner_controller
201
+ terraform destroy -target module.premerge_cluster_us_central. helm_release.github_actions_runner_controller
188
202
```
189
203
190
204
Then delete the namespace to ensure there are no dangling resources
191
205
``` bash
192
- terraform destroy -target kubernetes_namespace.llvm_premerge_controller
206
+ terraform destroy -target module.premerge_cluster_us_central. kubernetes_namespace.llvm_premerge_controller
193
207
```
194
208
195
209
### Bumping the Version Number
196
210
197
- This is not necessary only for bumping the version of ARC. This involves simply
211
+ This is necessary only for bumping the version of ARC. This involves simply
198
212
updating the version field for the ` helm_release ` objects in ` main.tf ` . Make sure
199
213
to commit the changes and push them to ` llvm-zorg ` to ensure others working on
200
214
the terraform configuration have an up to date state when they pull the repository.
0 commit comments