|
16 | 16 |
|
17 | 17 | # Autoscaling and Load Balancing Generative AI w/ Triton Server and TensorRT-LLM
|
18 | 18 |
|
19 |
| -Setting up autoscaling and load balancing using Triton Inference Server, TensorRT-LLM or vLLM, and Kubernetes is not difficult, |
20 |
| -but it does require preparation. |
| 19 | +Setting up autoscaling and load balancing for large language models served by Triton Inference Server is not difficult, |
| 20 | +but does require preparation. |
21 | 21 |
|
22 |
| -This guide aims to help you automated acquisition of models from Hugging Face, minimize time spent optimizing models for |
| 22 | +This guide outlines the steps to download models from Hugging Face, optimize them for TensorRT, and configure automatic scaling and load balancing in Kubernetes. |
23 | 23 | TensorRT, and configuring automatic scaling and load balancing for your models. This guide does not cover Kubernetes'
|
24 | 24 | basics, secure ingress/egress from your cluster to external clients, nor cloud provider interfaces or implementations of
|
25 | 25 | Kubernetes.
|
26 | 26 |
|
27 |
| -The intent of setting up autoscaling is that it provides the ability for LLM based services to automatically and dynamically |
| 27 | +When configured properly autoscaling enables LLM based services to allocate and deallocate resources automatically based on the current load. |
28 | 28 | adapt to the current workload intensity.
|
29 |
| -As the number of clients generating inference requests for a given Triton Server deployment, load will increase on the server |
| 29 | +In this tutorial, as the number of clients grow for a given Triton Server deployment, the inference load on the server increases |
30 | 30 | and the queue-to-compute ratio will eventually cause the horizontal pod autoscaler to increase the number of Triton Server
|
31 | 31 | instancing handle requests until the desired ratio is achieved.
|
32 | 32 | Inversely, decreasing the number of clients will reduce the number of Triton Server instances deployed.
|
@@ -73,7 +73,7 @@ Prior to beginning this guide/tutorial you will need a couple of things.
|
73 | 73 |
|
74 | 74 | ## Cluster Setup
|
75 | 75 |
|
76 |
| -The following instructions are setting up Horizontal Pod Autoscaling (HPA) for Triton Server in a Kubernetes cluster. |
| 76 | +The following instructions detail how to set up Horizontal Pod Autoscaling (HPA) for Triton Inference Server in a Kubernetes cluster. |
77 | 77 |
|
78 | 78 |
|
79 | 79 | ### Prerequisites
|
@@ -123,8 +123,8 @@ capabilities.
|
123 | 123 |
|
124 | 124 | #### NVIDIA Device Plugin for Kubernetes
|
125 | 125 |
|
126 |
| -1. This step is unnecessary if the Device Plugin has already been installed in your cluster. |
127 |
| - Cloud provider turnkey Kubernetes clusters, such as those from AKS, EKS, and GKE, often have the Device Plugin |
| 126 | +1. This step is not needed if the Device Plugin has already been installed in your cluster. |
| 127 | + Cloud providers with turnkey Kubernetes clusters, such as those from AKS, EKS, and GKE, often install the Device Plugin automatically when a GPU node is added to the cluster. |
128 | 128 | automatically once a GPU node as been added to the cluster.
|
129 | 129 |
|
130 | 130 | To check if your cluster requires the NVIDIA Device Plugin for Kubernetes, run the following command and inspect
|
@@ -154,9 +154,9 @@ capabilities.
|
154 | 154 |
|
155 | 155 | #### NVIDIA GPU Feature Discovery Service
|
156 | 156 |
|
157 |
| -1. This step is unnecessary if the service has already been installed in your cluster. |
| 157 | +1. This step is not needed if the service has already been installed in your cluster. |
158 | 158 |
|
159 |
| - To check if your cluster requires the NVIDIA Device Plugin for Kubernetes, run the following command and inspect |
| 159 | + To check if your cluster requires the NVIDIA GPU Feature Discovery Service, run the following command and inspect |
160 | 160 | the output for `nvidia-device-plugin-daemonset`.
|
161 | 161 |
|
162 | 162 | ```bash
|
@@ -197,9 +197,9 @@ capabilities.
|
197 | 197 |
|
198 | 198 | ### Metrics Collection Services
|
199 | 199 |
|
200 |
| -Your cluster is now up, running, and can even assign GPU resources to containers. |
| 200 | +Your cluster is now up, running, and can assign GPU resources to containers. |
201 | 201 | Next, we have to setup metrics collection for DCGM and Triton Server.
|
202 |
| -Metrics provide insight to the Kubernetes Horizontal Pod Autoscaler service and enable it to make autoscaling decisions based |
| 202 | +Metrics services provide utilization and availability data to the Kubernetes Horizontal Pod Autoscaler. The data can then be used to make autoscaling decisions. |
203 | 203 | on the utilization and availability of deployed models.
|
204 | 204 |
|
205 | 205 | #### Create a Monitoring Namespace
|
@@ -247,7 +247,7 @@ Using the following steps, we'll install the Prometheus Stack for Kubernetes Hel
|
247 | 247 | The best solution for management of GPUs in your cluster is
|
248 | 248 | [NVIDIA DCGM](https://docs.nvidia.com/data-center-gpu-manager-dcgm)(DCGM).
|
249 | 249 | However, for this example we do not need the entirety of the DCGM stack.
|
250 |
| -Instead, we'll use the steps below to install the just [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) to enable the |
| 250 | +Instead, we'll use the steps below to install just the [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) to enable the |
251 | 251 | collection of GPU metrics in your cluster.
|
252 | 252 |
|
253 | 253 | 1. Add the NVIDIA DCGM chart repository to the local cache.
|
|
0 commit comments