Skip to content

Commit 392bb00

Browse files
xuzhao9facebook-github-bot
authored andcommitted
Deploy the H100 runner infra (#24)
Summary: Add H100 runner CI infra Pull Request resolved: #24 Test Plan: #25 <img width="790" alt="image" src="https://github.com/user-attachments/assets/9980f74f-c47a-4c9f-9f99-e6e3f265c59b"> Reviewed By: FindHao Differential Revision: D65139600 Pulled By: xuzhao9 fbshipit-source-id: 229fe3a8c98c4450b11bc64e437a7018942d56ad
1 parent 72ddac9 commit 392bb00

File tree

3 files changed

+491
-0
lines changed

3 files changed

+491
-0
lines changed

docker/infra/README.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# TritonBench Infra Configuration on Google Cloud Platform
2+
3+
It defines the specification of infrastruture used by TorchBench CI.
4+
The Infra is a Kubernetes cluster built on top of Google Cloud Platform.
5+
6+
## Step 1: Create the cluster and install the ARC Controller
7+
8+
```
9+
# Get credentials for the cluster so that kubectl could use it
10+
gcloud container clusters get-credentials --location us-central1 tritonbench-h100-cluster
11+
12+
# Install the ARC controller
13+
INSTALLATION_NAME="gcp-h100-runner"
14+
NAMESPACE="arc-systems"
15+
helm install "${INSTALLATION_NAME}" \
16+
--namespace "${NAMESPACE}" \
17+
--create-namespace \
18+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set-controller
19+
```
20+
21+
### Maintainence
22+
23+
To uninstall the ARC controller:
24+
25+
```
26+
INSTALLATION_NAME="gcp-h100-runner"
27+
NAMESPACE="arc-systems"
28+
helm uninstall -n "${NAMESPACE}" "${INSTALLATION_NAME}"
29+
```
30+
31+
To inspect the controller installation logs:
32+
33+
```
34+
NAMESPACE="arc-systems"
35+
kubectl get pods -n "${NAMESPACE}"
36+
# get the pod name like gcp-h100-runner-gha-rs-controller-...
37+
kubectl logs -n ${NAMESPACE} gcp-h100-runner-gha-rs-controller-...
38+
```
39+
40+
## Step 2: Create secrets and assign it to the namespace
41+
42+
The secrets need to be added to both `arc-systems` and `arc-runners` namespaces.
43+
44+
```
45+
# Set GitHub App secret
46+
kubectl create secret generic arc-secret \
47+
--namespace=arc-runners \
48+
--from-literal=github_app_id=${GITHUB_APP_ID} \
49+
--from-literal=github_app_installation_id=${GITHUB_APP_INSTALL_ID} \
50+
--from-file=github_app_private_key=${GITHUB_APP_PRIVKEY_FILE}
51+
52+
# Alternatively, set classic PAT
53+
kubectl create secret generic arc-secret \
54+
--namespace=arc-runners \
55+
--from-literal=github_token="<GITHUB_PAT>"
56+
```
57+
58+
To get, delete, or update the secrets:
59+
60+
```
61+
# Get
62+
kubectl get -A secrets
63+
# Delete
64+
kubectl delete secrets -n arc-runners arc-secret
65+
# Update
66+
kubectl edit secrets -n arc-runners arc-secret
67+
```
68+
69+
## Step 3: Install runner scale set
70+
71+
```
72+
INSTALLATION_NAME="gcp-h100-runner"
73+
NAMESPACE="arc-runners"
74+
GITHUB_SECRET_NAME="arc-secret"
75+
helm install "${INSTALLATION_NAME}" \
76+
--namespace "${NAMESPACE}" \
77+
--create-namespace \
78+
-f values.yaml \
79+
oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
80+
```
81+
82+
To upgrade or uninstall the runner scale set:
83+
84+
```
85+
# command to upgrade
86+
helm upgrade --install gcp-h100-runner -n arc-runners -f ./values.yaml oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set
87+
88+
# command to uninstall
89+
helm uninstall -n arc-runners gcp-h100-runner
90+
```
91+
92+
To inspect runner sacle set logs:
93+
94+
```
95+
kubectl get pods -n arc-runners
96+
# get arc runner name like gcp-h100-runner-...
97+
# inspect the logs
98+
kubectl logs -n arc-runners gcp-h100-runner-...
99+
```
100+
101+
## Step 4: Install NVIDIA driver on the K8s host machine
102+
103+
```
104+
kubectl apply -f daemonset.yaml
105+
```
106+
107+
When the host machine runs Ubuntu, use the following command to find all available driver versions:
108+
109+
```
110+
gsutil ls gs://ubuntu_nvidia_packages/
111+
112+
# For example:
113+
# gsutil ls gs://ubuntu_nvidia_packages/nvidia-driver-gke_jammy-5.15.0-1048-gke-535.104.05_amd64.deb
114+
```
115+
116+
117+
## Troubleshooting
118+
119+
If all the Pods are created but in `Pending` state, it could be the NVIDIA driver version updates and
120+
the old version is deleted.
121+
122+
To check if the NVIDIA driver is installed correctly:
123+
124+
```
125+
kubectl -n kube-system logs -f daemonset.apps/nvidia-driver-installer -c nvidia-driver-installer
126+
```
127+
128+
If the NVIDIA driver file is not found, find the available versions using
129+
130+
```
131+
gsutil ls gs://ubuntu_nvidia_packages/
132+
```
133+
134+
and update the version in `daemonset.yaml`.

docker/infra/daemonset.yaml

+77
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Copyright 2017 Google Inc. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: apps/v1
16+
kind: DaemonSet
17+
metadata:
18+
name: nvidia-driver-installer
19+
namespace: kube-system
20+
labels:
21+
k8s-app: nvidia-driver-installer
22+
spec:
23+
selector:
24+
matchLabels:
25+
k8s-app: nvidia-driver-installer
26+
updateStrategy:
27+
type: RollingUpdate
28+
template:
29+
metadata:
30+
labels:
31+
name: nvidia-driver-installer
32+
k8s-app: nvidia-driver-installer
33+
spec:
34+
priorityClassName: system-node-critical
35+
affinity:
36+
nodeAffinity:
37+
requiredDuringSchedulingIgnoredDuringExecution:
38+
nodeSelectorTerms:
39+
- matchExpressions:
40+
- key: cloud.google.com/gke-accelerator
41+
operator: Exists
42+
- key: cloud.google.com/gke-gpu-driver-version
43+
operator: DoesNotExist
44+
tolerations:
45+
- operator: "Exists"
46+
volumes:
47+
- name: dev
48+
hostPath:
49+
path: /dev
50+
- name: boot
51+
hostPath:
52+
path: /boot
53+
- name: root-mount
54+
hostPath:
55+
path: /
56+
initContainers:
57+
- image: gke-nvidia-installer:fixed
58+
name: nvidia-driver-installer
59+
resources:
60+
requests:
61+
cpu: 150m
62+
securityContext:
63+
privileged: true
64+
volumeMounts:
65+
- name: boot
66+
mountPath: /boot
67+
- name: dev
68+
mountPath: /dev
69+
- name: root-mount
70+
mountPath: /root
71+
# env:
72+
# - name: NVIDIA_DRIVER_VERSION
73+
# value: latest
74+
# value: "535.161.07"
75+
containers:
76+
- image: "gke.gcr.io/pause:3.8@sha256:880e63f94b145e46f1b1082bb71b85e21f16b99b180b9996407d61240ceb9830"
77+
name: pause

0 commit comments

Comments
 (0)