Skip to content

Commit 2a20922

Browse files
committed
Pre-commit fixes
1 parent 66481c4 commit 2a20922

File tree

9 files changed

+60
-30
lines changed

9 files changed

+60
-30
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ repos:
6565
- id: check-json
6666
- id: check-toml
6767
- id: check-yaml
68-
exclude: ^Deployment/Kubernetes/[^/]+/chart/templates/.+$
68+
exclude: ^Deployment/Kubernetes/.+$
6969
- id: check-shebang-scripts-are-executable
7070
- id: end-of-file-fixer
7171
types_or: [c, c++, cuda, proto, textproto, java, python]

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/2. Configure_EKS_Cluster.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Steps to set up cluster
22

3-
In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM.
4-
*
3+
In this guide we will set up the Kubernetes cluster for the deployment of LLMs using Triton Server and TRT-LLM.
4+
*
55
## 1. Add node label and taint
66

77
As first step we will add node labels and taints
@@ -98,7 +98,7 @@ In you local browser, you should be able to see metrics in `localhost:8080`.
9898

9999
## 7. Install Prometheus Adapter
100100

101-
This allows the Triton metrics collected by Prometheus server to be available to Kuberntes' Horizontal Pod Autoscaler service.
101+
This allows the Triton metrics collected by Prometheus server to be available to Kubernetes' Horizontal Pod Autoscaler service.
102102

103103
```
104104
helm install -n monitoring prometheus-adapter prometheus-community/prometheus-adapter \
@@ -125,7 +125,7 @@ This generates custom metrics from a formula that uses the Triton metrics collec
125125
kubectl apply -f triton-metrics_prometheus-rule.yaml
126126
```
127127

128-
At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be availble to Prometheus server now. You can verify by showing all metrics in Prometheus server:
128+
At this point, all metrics components should have been installed. All metrics including Triton metrics, DCGM metrics, and custom metrics should be available to Prometheus server now. You can verify by showing all metrics in Prometheus server:
129129

130130
```
131131
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 8080:9090

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/3. Deploy_Triton.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ trtllm-build --checkpoint_dir ./converted_checkpoint \
8787
--use_custom_all_reduce disable \ # only disable on non-NVLink machines like g5.12xlarge
8888
--max_input_len 2048 \
8989
--max_output_len 2048 \
90-
--max_batch_size 4
90+
--max_batch_size 4
9191
```
9292

9393
### c. Prepare the Triton model repository
@@ -108,7 +108,7 @@ python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton
108108
```
109109

110110
> [!Note]
111-
> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository shoudl be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
111+
> Be sure to substitute the correct values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` in the example above. Keep in mind that the tokenizer, the TRT-LLM engines, and the Triton model repository should be in a shared file storage between your nodes. They're required to launch your model in Triton. For example, if using AWS EFS, the values for `<PATH_TO_TOKENIZER>` and `<PATH_TO_ENGINES>` should be respect to the actutal EFS mount path. This is determined by your persistent-volume claim and mount path in chart/templates/deployment.yaml. Make sure that your nodes are able to access these files.
112112
113113
## 3. Create `example_values.yaml` file for deployment
114114

@@ -177,7 +177,7 @@ kubectl logs --follow leaderworkerset-sample-0
177177
You should output something similar to below:
178178

179179
```
180-
I0717 23:01:28.501008 300 server.cc:674]
180+
I0717 23:01:28.501008 300 server.cc:674]
181181
+----------------+---------+--------+
182182
| Model | Version | Status |
183183
+----------------+---------+--------+
@@ -187,7 +187,7 @@ I0717 23:01:28.501008 300 server.cc:674]
187187
| tensorrt_llm | 1 | READY |
188188
+----------------+---------+--------+
189189
190-
I0717 23:01:28.501073 300 tritonserver.cc:2579]
190+
I0717 23:01:28.501073 300 tritonserver.cc:2579]
191191
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
192192
| Option | Value |
193193
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
@@ -347,9 +347,9 @@ kubectl logs -f $(kubectl get pods | grep launcher | cut -d ' ' -f 1)
347347
You should output something similar to below (example of 2 x g5.12xlarge):
348348

349349
```
350-
[1,0]<stdout>:# out-of-place in-place
350+
[1,0]<stdout>:# out-of-place in-place
351351
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
352-
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
352+
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
353353
[1,0]<stdout>: 8 2 float sum -1[1,0]<stdout>: 99.10 0.00 0.00 0[1,0]<stdout>: 100.6 0.00 0.00 0
354354
[1,0]<stdout>: 16 4 float sum -1[1,0]<stdout>: 103.4 0.00 0.00 0[1,0]<stdout>: 102.5 0.00 0.00 0
355355
[1,0]<stdout>: 32 8 float sum -1[1,0]<stdout>: 103.5 0.00 0.00 0[1,0]<stdout>: 102.5 0.00 0.00 0
@@ -429,7 +429,7 @@ genai-perf \
429429
You should output something similar to below (example of Mixtral 8x7B on 2 x g5.12xlarge):
430430

431431
```
432-
LLM Metrics
432+
LLM Metrics
433433
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
434434
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
435435
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ We have 1 pod per node, so the main challenge in deploying models that require m
66

77
1. **LeaderWorkerSet for launching Triton+TRT-LLM on groups of pods:** To launch Triton and TRT-LLM across nodes you use MPI to have one node launch TRT-LLM processes on all the nodes (including itself) that will make up one instance of the model. Doing this requires knowing the hostnames of all involved nodes. Consequently we need to spawn groups of pods and know which model instance group they belong to. To achieve this we use [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws/tree/main), which lets us create "megapods" that consist of a group of pods - one leader pod and a specified number of worker pods - and provides pod labels identifying group membership. We configure the LeaderWorkerSet and launch Triton+TRT-LLM via MPI in [`deployment.yaml`](multinode_helm_chart/chart/templates/deployment.yaml) and [server.py](multinode_helm_chart/containers/server.py).
88
2. **Gang Scheduling:** Gang scheduling simply means ensuring all pods that make up a model instance are ready before Triton+TRT-LLM is launched. We show how to use `kubessh` to achieve this in the `wait_for_workers` function of [server.py](multinode_helm_chart/containers/server.py).
9-
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in reponse to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
9+
3. **Autoscaling:** By default the Horizontal Pod Autoscaler (HPA) scales individual pods, but LeaderWorkerSet makes it possible to scale each "megapod". However, since these are GPU workloads we don't want to use cpu and host memory usage for autoscaling. We show how to leverage the metrics Triton Server exposes through Prometheus and set up GPU utilization recording rules in [`triton-metrics_prometheus-rule.yaml`](multinode_helm_chart/triton-metrics_prometheus-rule.yaml). We also demonstrate how to properly set up PodMonitors and an HPA in [`pod-monitor.yaml`](multinode_helm_chart/chart/templates/pod-monitor.yaml) and [`hpa.yaml`](multinode_helm_chart/chart/templates/hpa.yaml) (the key is to only scrape metrics from the leader pods). Instructions for properly setting up Prometheus and exposing GPU metrics are found in [Configure EKS Cluster and Install Dependencies](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md). To enable deployment to dynamically add more nodes in response to HPA, we also setup [Cluster Autoscaler](https://github.com/Wenhan-Tan/EKS_Multinode_Triton_TRTLLM/blob/main/Cluster_Setup_Steps.md#10-install-cluster-autoscaler)
1010
4. **LoadBalancer Setup:** Although there are multiple pods in each instance of the model, only one pod within each group accepts requests. We show how to correctly set up a LoadBalancer Service to allow external clients to submit requests in [`service.yaml`](multinode_helm_chart/chart/templates/service.yaml)
1111

1212

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/aws-efa-k8s-device-plugin/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ helm install efa ./aws-efa-k8s-device-plugin -n kube-system
1919

2020
# Configuration
2121

22-
Paramter | Description | Default
22+
Parameter | Description | Default
2323
--- | --- | ---
2424
`image.repository` | EFA image repository | `602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/aws-efa-k8s-device-plugin`
2525
`image.tag` | EFA image tag | `v0.5.3`
@@ -31,7 +31,7 @@ Paramter | Description | Default
3131
`nodeSelector` | Node labels for pod assignment | `{}`
3232
`tolerations` | Optional deployment tolerations | `[]`
3333
`additionalPodAnnotations` | Pod annotations to apply in addition to the default ones | `{}`
34-
`additionalPodLabels` | Pod labels to apply in addition to the defualt ones | `{}`
34+
`additionalPodLabels` | Pod labels to apply in addition to the default ones | `{}`
3535
`nameOverride` | Override the name of the chart | `""`
3636
`fullnameOverride` | Override the full name of the chart | `""`
3737
`imagePullSecrets` | Docker registry pull secret | `[]`

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/chart/values.schema.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@
127127
},
128128
"required": [
129129
"image",
130-
"triton_model_repo_path"
130+
"triton_model_repo_path"
131131
],
132132
"type": "object"
133133
},

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
# Container Generation
1919

20-
The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA componenets.
20+
The files in this folder are intended to be used to create the custom container image for multi-node Triton + TRT-LLM EKS deployment including installation of EFA components.
2121

2222
Run the following command to create the container image.
2323

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/multinode_helm_chart/containers/server.py

Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
EXIT_SUCCESS = 0
2727
DELAY_BETWEEN_QUERIES = 2
2828

29+
2930
def die(exit_code: int):
3031
if exit_code is None:
3132
exit_code = ERROR_CODE_FATAL
@@ -36,10 +37,17 @@ def die(exit_code: int):
3637

3738
exit(exit_code)
3839

40+
3941
def parse_arguments():
4042
parser = argparse.ArgumentParser()
4143
parser.add_argument("mode", type=str, choices=["leader", "worker"])
42-
parser.add_argument("--triton_model_repo_dir", type=str, default=None,required=True,help="Directory that contains Triton Model Repo to be served")
44+
parser.add_argument(
45+
"--triton_model_repo_dir",
46+
type=str,
47+
default=None,
48+
required=True,
49+
help="Directory that contains Triton Model Repo to be served",
50+
)
4351
parser.add_argument("--pp", type=int, default=1, help="Pipeline parallelism.")
4452
parser.add_argument("--tp", type=int, default=1, help="Tensor parallelism.")
4553
parser.add_argument("--iso8601", action="count", default=0)
@@ -55,11 +63,19 @@ def parse_arguments():
5563
type=int,
5664
help="How many gpus are in each pod/node (We launch one pod per node). Only required in leader mode.",
5765
)
58-
parser.add_argument("--stateful_set_group_key",type=str,default=None,help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode")
59-
parser.add_argument("--enable_nsys", action="store_true", help="Enable Triton server profiling")
66+
parser.add_argument(
67+
"--stateful_set_group_key",
68+
type=str,
69+
default=None,
70+
help="Value of leaderworkerset.sigs.k8s.io/group-key, Leader uses this to gang schedule and its only needed in leader mode",
71+
)
72+
parser.add_argument(
73+
"--enable_nsys", action="store_true", help="Enable Triton server profiling"
74+
)
6075

6176
return parser.parse_args()
6277

78+
6379
def run_command(cmd_args: [str], omit_args: [int] = None):
6480
command = ""
6581

@@ -75,10 +91,12 @@ def run_command(cmd_args: [str], omit_args: [int] = None):
7591

7692
return subprocess.call(cmd_args, stderr=sys.stderr, stdout=sys.stdout)
7793

94+
7895
def signal_handler(sig, frame):
7996
write_output(f"Signal {sig} detected, quitting.")
8097
exit(EXIT_SUCCESS)
8198

99+
82100
def wait_for_workers(num_total_pod: int, args):
83101
if num_total_pod is None or num_total_pod <= 0:
84102
raise RuntimeError("Argument `world_size` must be greater than zero.")
@@ -131,14 +149,19 @@ def wait_for_workers(num_total_pod: int, args):
131149

132150
return workers
133151

152+
134153
def write_output(message: str):
135154
print(message, file=sys.stdout, flush=True)
136155

156+
137157
def write_error(message: str):
138158
print(message, file=sys.stderr, flush=True)
139159

160+
140161
def do_leader(args):
141-
write_output(f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node")
162+
write_output(
163+
f"Server is assuming each node has {args.gpu_per_node} GPUs. To change this, use --gpu_per_node"
164+
)
142165

143166
world_size = args.tp * args.pp
144167

@@ -152,9 +175,11 @@ def do_leader(args):
152175
workers = wait_for_workers(world_size / args.gpu_per_node, args)
153176

154177
if len(workers) != (world_size / args.gpu_per_node):
155-
write_error(f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}.")
178+
write_error(
179+
f"fatal: {len(workers)} found, expected {world_size / args.gpu_per_node}."
180+
)
156181
die(ERROR_EXIT_DELAY)
157-
182+
158183
workers_with_mpi_slots = [worker + f":{args.gpu_per_node}" for worker in workers]
159184

160185
if args.enable_nsys:
@@ -241,17 +266,21 @@ def do_leader(args):
241266

242267
exit(result)
243268

269+
244270
def do_worker(args):
245271
signal.signal(signal.SIGINT, signal_handler)
246272
signal.signal(signal.SIGTERM, signal_handler)
247273

248274
write_output("Worker paused awaiting SIGINT or SIGTERM.")
249275
signal.pause()
250276

277+
251278
def main():
252279
write_output("Reporting system information.")
253280
run_command(["whoami"])
254-
run_command(["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"])
281+
run_command(
282+
["cgget", "-n", "--values-only", "--variable memory.limit_in_bytes", "/"]
283+
)
255284
run_command(["nvidia-smi"])
256285

257286
args = parse_arguments()
@@ -275,5 +304,6 @@ def main():
275304
write_error(f' Supported values are "init" or "exec".')
276305
die(ERROR_CODE_USAGE)
277306

278-
if __name__ == '__main__':
307+
308+
if __name__ == "__main__":
279309
main()

Deployment/Kubernetes/EKS_Multinode_Triton_TRTLLM/p5-trtllm-cluster-config.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,19 @@ vpc:
1515
public:
1616
us-east-1a:
1717
id: $PLACEHOLDER_SUBNET_PUBLIC_1
18-
18+
1919
clusterEndpoints:
2020
privateAccess: true
2121
publicAccess: true
22-
22+
2323
cloudwatch:
2424
clusterLogging:
25-
enableTypes: ["*"]
25+
enableTypes: ["*"]
2626

2727
iam:
2828
withOIDC: true
2929

30-
30+
3131
managedNodeGroups:
3232
- name: cpu-node-group
3333
instanceType: c5.2xlarge
@@ -45,7 +45,7 @@ managedNodeGroups:
4545
albIngress: true
4646
- name: gpu-compute-node-group
4747
instanceType: p5.48xlarge
48-
instancePrefix: trtllm-compute-node
48+
instancePrefix: trtllm-compute-node
4949
privateNetworking: true
5050
efaEnabled: true
5151
minSize: 0

0 commit comments

Comments
 (0)