You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| use_uv | boolean | false | Use UV for faster Python package installation |
29
+
| deployment_initialization_timeout | integer | 600 (10 minutes) | The max time to wait for app initialisation during build before timing out. Value must be between 60 and 830 |
29
30
30
31
<Info>
31
32
Changes to python_version or docker_base_image_url trigger full rebuilds since
@@ -57,7 +58,7 @@ use_uv = true
57
58
Check your build logs for these indicators:
58
59
59
60
-**UV_PIP_INSTALL_STARTED** - UV is successfully being used
60
-
-**PIP_INSTALL_STARTED** - Standard pip installation (when `use_uv=false`)
61
+
-**PIP_INSTALL_STARTED** - Standard pip installation (when `use_uv` is `false`)
61
62
62
63
<Warning>
63
64
While UV is compatible with most packages, some edge cases may cause build
Copy file name to clipboardExpand all lines: v4/examples/deploy-an-llm-with-tensorrtllm-tritonserver.mdx
+45-41Lines changed: 45 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,11 +3,10 @@ title: "Deploy Triton Inference server and TensorRT-LLM"
3
3
description: "Achieve high throughput with Triton Inference Server and the TensorRT-LLM framework"
4
4
---
5
5
6
-
In this tutorial, we'll show you how to deploy Llama 3.2 3B using TensorRT-LLM's PyTorch backend served through Nvidia Triton Inference Server.
6
+
In this tutorial, we'll show you how to deploy Llama 3.2 3B using TensorRT-LLM's PyTorch backend served through Nvidia Triton Inference Server.
7
7
8
8
The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline (vanilla deployment), while reducing latency by **7-9x** across all percentiles. See the [Performance Analysis](#performance-analysis) section for detailed test methodology and results.
9
9
10
-
11
10
You can view the final implementation [here](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
12
11
13
12
## Why TensorRT + Triton?
@@ -22,10 +21,11 @@ TensorRT requires you to specify optimization parameters upfront - GPU architect
22
21
23
22
NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box.
24
23
25
-
Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
24
+
Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
26
25
[Here](https://substackcdn.com/image/fetch/$s_!FEPb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d4460ad-0e7e-4545-aee6-274b93dd5959_2300x2304.gif) is a diagram of how Triton works.
27
26
28
27
Below is the process of how the two work together in terms of handling requests:
28
+
29
29
1. Client sends text via HTTP/gRPC to Triton
30
30
2. Triton queues the request in the scheduler
31
31
3. Triton batches incoming requests (waits for more or timeout)
@@ -57,7 +57,7 @@ In order to download the model to Cerebrium, you need to be [granted acces](http
57
57
58
58
## Implementation
59
59
60
-
All files should be placed in the same project directory.
60
+
All files should be placed in the same project directory.
61
61
62
62
### Triton Model Configuration
63
63
@@ -115,6 +115,7 @@ output [
115
115
```
116
116
117
117
This configuration tells Triton:
118
+
118
119
- Use Python backend (runs our model.py)
119
120
- Automatically batch up to 128 requests together for efficient GPU utilization
120
121
- Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
@@ -126,7 +127,7 @@ This configuration tells Triton:
126
127
127
128
Triton's Python backend requires implementing a `TritonPythonModel` class with three key methods:
128
129
129
-
-**`initialize(args)`**: Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
130
+
-**`initialize(args)`**: Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
130
131
131
132
-**`execute(requests)`**: Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured `max_batch_size`) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.
132
133
@@ -155,81 +156,81 @@ class TritonPythonModel:
155
156
"""Initialize TensorRT-LLM with PyTorch backend."""
max_batch_size=128, # Matches Triton max_batch_size in config.pbtxt
168
169
)
169
-
170
+
170
171
self.llm = LLM(
171
172
model=MODEL_DIR,
172
173
build_config=build_config,
173
174
tensor_parallel_size=torch.cuda.device_count(),
174
175
)
175
176
print("✓ Model ready")
176
-
177
+
177
178
defexecute(self, requests):
178
179
"""
179
180
Execute inference on batched requests.
180
-
181
+
181
182
Triton automatically batches requests (up to max_batch_size: 128).
182
183
This function processes the batch that Triton provides.
183
184
"""
184
185
try:
185
186
prompts = []
186
187
sampling_params_list = []
187
188
original_prompts = []
188
-
189
+
189
190
# Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
190
191
for request in requests:
191
192
try:
192
193
# Get input text - handle batched tensor structures
The Dockerfile uses Nvidia's official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.
384
+
The Dockerfile uses Nvidia's official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.
-`replica_concurrency = 128`: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
423
424
-`max_replicas = 5`: Scale up to 5 replicas for peak load
424
425
@@ -477,7 +478,9 @@ The endpoint returns results in this format:
477
478
"name": "text_output",
478
479
"datatype": "BYTES",
479
480
"shape": [1],
480
-
"data": ["Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."]
481
+
"data": [
482
+
"Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."
483
+
]
481
484
}
482
485
]
483
486
}
@@ -492,12 +495,14 @@ The response follows Triton's standard inference protocol format with the genera
492
495
To validate the performance improvements of TensorRT + Triton, we compared it against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions.
493
496
494
497
**Vanilla Baseline Setup:**
498
+
495
499
- Model served directly using HuggingFace Transformers with PyTorch
496
500
- Single request processing (no batching)
497
501
- Standard FastAPI endpoint
498
502
- Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
499
503
500
504
**TensorRT + Triton Setup:**
505
+
501
506
- TensorRT-LLM with PyTorch backend
502
507
- Triton Inference Server with dynamic batching (max batch size: 128)
503
508
- Automatic request queuing and batching
@@ -507,21 +512,20 @@ Both deployments were tested with the same load testing parameters to ensure fai
The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline, while reducing latency by **7-9x** across all percentiles. The baseline's 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency.
520
524
521
525
These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.
522
526
523
527
## Get Started
524
528
525
-
The complete implementation, including all configuration files and deployment scripts, is available in our [GitHub repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
529
+
The complete implementation, including all configuration files and deployment scripts, is available in our [GitHub repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
526
530
527
-
Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for your own models) with TensorRT-LLM and Triton Inference Server. You'll have a production-ready, high-performance LLM serving endpoint in minutes.
531
+
Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for your own models) with TensorRT-LLM and Triton Inference Server. You'll have a production-ready, high-performance LLM serving endpoint in minutes.
0 commit comments