Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit dceba28

Browse files
whoisjnnshah1
andcommittedJun 12, 2024·
Gen AI Autoscaling Tutorial: improvement based on review
This change inlcudes a number of improvements suggested by @nnshah1. Co-authored-by: Neelay Shah <neelays@nvidia.com>
1 parent 70d533a commit dceba28

File tree

1 file changed

+13
-13
lines changed
  • Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing

1 file changed

+13
-13
lines changed
 

‎Deployment/Kubernetes/TensorRT-LLM_Autoscaling_and_Load_Balancing/README.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,17 @@
1616

1717
# Autoscaling and Load Balancing Generative AI w/ Triton Server and TensorRT-LLM
1818

19-
Setting up autoscaling and load balancing using Triton Inference Server, TensorRT-LLM or vLLM, and Kubernetes is not difficult,
20-
but it does require preparation.
19+
Setting up autoscaling and load balancing for large language models served by Triton Inference Server is not difficult,
20+
but does require preparation.
2121

22-
This guide aims to help you automated acquisition of models from Hugging Face, minimize time spent optimizing models for
22+
This guide outlines the steps to download models from Hugging Face, optimize them for TensorRT, and configure automatic scaling and load balancing in Kubernetes.
2323
TensorRT, and configuring automatic scaling and load balancing for your models. This guide does not cover Kubernetes'
2424
basics, secure ingress/egress from your cluster to external clients, nor cloud provider interfaces or implementations of
2525
Kubernetes.
2626

27-
The intent of setting up autoscaling is that it provides the ability for LLM based services to automatically and dynamically
27+
When configured properly autoscaling enables LLM based services to allocate and deallocate resources automatically based on the current load.
2828
adapt to the current workload intensity.
29-
As the number of clients generating inference requests for a given Triton Server deployment, load will increase on the server
29+
In this tutorial, as the number of clients grow for a given Triton Server deployment, the inference load on the server increases
3030
and the queue-to-compute ratio will eventually cause the horizontal pod autoscaler to increase the number of Triton Server
3131
instancing handle requests until the desired ratio is achieved.
3232
Inversely, decreasing the number of clients will reduce the number of Triton Server instances deployed.
@@ -73,7 +73,7 @@ Prior to beginning this guide/tutorial you will need a couple of things.
7373

7474
## Cluster Setup
7575

76-
The following instructions are setting up Horizontal Pod Autoscaling (HPA) for Triton Server in a Kubernetes cluster.
76+
The following instructions detail how to set up Horizontal Pod Autoscaling (HPA) for Triton Inference Server in a Kubernetes cluster.
7777

7878

7979
### Prerequisites
@@ -123,8 +123,8 @@ capabilities.
123123
124124
#### NVIDIA Device Plugin for Kubernetes
125125
126-
1. This step is unnecessary if the Device Plugin has already been installed in your cluster.
127-
Cloud provider turnkey Kubernetes clusters, such as those from AKS, EKS, and GKE, often have the Device Plugin
126+
1. This step is not needed if the Device Plugin has already been installed in your cluster.
127+
Cloud providers with turnkey Kubernetes clusters, such as those from AKS, EKS, and GKE, often install the Device Plugin automatically when a GPU node is added to the cluster.
128128
automatically once a GPU node as been added to the cluster.
129129
130130
To check if your cluster requires the NVIDIA Device Plugin for Kubernetes, run the following command and inspect
@@ -154,9 +154,9 @@ capabilities.
154154
155155
#### NVIDIA GPU Feature Discovery Service
156156
157-
1. This step is unnecessary if the service has already been installed in your cluster.
157+
1. This step is not needed if the service has already been installed in your cluster.
158158
159-
To check if your cluster requires the NVIDIA Device Plugin for Kubernetes, run the following command and inspect
159+
To check if your cluster requires the NVIDIA GPU Feature Discovery Service, run the following command and inspect
160160
the output for `nvidia-device-plugin-daemonset`.
161161
162162
```bash
@@ -197,9 +197,9 @@ capabilities.
197197
198198
### Metrics Collection Services
199199
200-
Your cluster is now up, running, and can even assign GPU resources to containers.
200+
Your cluster is now up, running, and can assign GPU resources to containers.
201201
Next, we have to setup metrics collection for DCGM and Triton Server.
202-
Metrics provide insight to the Kubernetes Horizontal Pod Autoscaler service and enable it to make autoscaling decisions based
202+
Metrics services provide utilization and availability data to the Kubernetes Horizontal Pod Autoscaler. The data can then be used to make autoscaling decisions.
203203
on the utilization and availability of deployed models.
204204
205205
#### Create a Monitoring Namespace
@@ -247,7 +247,7 @@ Using the following steps, we'll install the Prometheus Stack for Kubernetes Hel
247247
The best solution for management of GPUs in your cluster is
248248
[NVIDIA DCGM](https://docs.nvidia.com/data-center-gpu-manager-dcgm)(DCGM).
249249
However, for this example we do not need the entirety of the DCGM stack.
250-
Instead, we'll use the steps below to install the just [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) to enable the
250+
Instead, we'll use the steps below to install just the [DCGM Exporter](https://github.com/NVIDIA/dcgm-exporter) to enable the
251251
collection of GPU metrics in your cluster.
252252

253253
1. Add the NVIDIA DCGM chart repository to the local cache.

0 commit comments

Comments
 (0)
Please sign in to comment.