Skip to content

Commit 10aa55f

Browse files
authored
Merge pull request #255 from CerebriumAI/wesley/update_docs
fix: Add initialisation timeout param to documentation
2 parents f3bc875 + 7862dbb commit 10aa55f

File tree

2 files changed

+58
-53
lines changed

2 files changed

+58
-53
lines changed

toml-reference/toml-reference.mdx

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,18 @@ The configuration is organized into the following main sections:
1515

1616
The `[cerebrium.deployment]` section defines core deployment settings.
1717

18-
| Option | Type | Default | Description |
19-
| --------------------- | -------- | ---------------------- | ----------------------------------------------------------- |
20-
| name | string | required | Desired app name |
21-
| python_version | string | "3.12" | Python version to use (3.10, 3.11, 3.12) |
22-
| disable_auth | boolean | false | Disable default token-based authentication on app endpoints |
23-
| include | string[] | ["*"] | Files/patterns to include in deployment |
24-
| exclude | string[] | [".*"] | Files/patterns to exclude from deployment |
25-
| shell_commands | string[] | [] | Commands to run at the end of the build |
26-
| pre_build_commands | string[] | [] | Commands to run before dependencies install |
27-
| docker_base_image_url | string | "debian:bookworm-slim" | Base Docker image |
28-
| use_uv | boolean | false | Use UV for faster Python package installation |
18+
| Option | Type | Default | Description |
19+
| --------------------------------- | -------- | ---------------------- | ------------------------------------------------------------------------------------------------------------ |
20+
| name | string | required | Desired app name |
21+
| python_version | string | "3.12" | Python version to use (3.10, 3.11, 3.12) |
22+
| disable_auth | boolean | false | Disable default token-based authentication on app endpoints |
23+
| include | string[] | ["*"] | Files/patterns to include in deployment |
24+
| exclude | string[] | [".*"] | Files/patterns to exclude from deployment |
25+
| shell_commands | string[] | [] | Commands to run at the end of the build |
26+
| pre_build_commands | string[] | [] | Commands to run before dependencies install |
27+
| docker_base_image_url | string | "debian:bookworm-slim" | Base Docker image |
28+
| use_uv | boolean | false | Use UV for faster Python package installation |
29+
| deployment_initialization_timeout | integer | 600 (10 minutes) | The max time to wait for app initialisation during build before timing out. Value must be between 60 and 830 |
2930

3031
<Info>
3132
Changes to python_version or docker_base_image_url trigger full rebuilds since
@@ -57,7 +58,7 @@ use_uv = true
5758
Check your build logs for these indicators:
5859

5960
- **UV_PIP_INSTALL_STARTED** - UV is successfully being used
60-
- **PIP_INSTALL_STARTED** - Standard pip installation (when `use_uv=false`)
61+
- **PIP_INSTALL_STARTED** - Standard pip installation (when `use_uv` is `false`)
6162

6263
<Warning>
6364
While UV is compatible with most packages, some edge cases may cause build

v4/examples/deploy-an-llm-with-tensorrtllm-tritonserver.mdx

Lines changed: 45 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,10 @@ title: "Deploy Triton Inference server and TensorRT-LLM"
33
description: "Achieve high throughput with Triton Inference Server and the TensorRT-LLM framework"
44
---
55

6-
In this tutorial, we'll show you how to deploy Llama 3.2 3B using TensorRT-LLM's PyTorch backend served through Nvidia Triton Inference Server.
6+
In this tutorial, we'll show you how to deploy Llama 3.2 3B using TensorRT-LLM's PyTorch backend served through Nvidia Triton Inference Server.
77

88
The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline (vanilla deployment), while reducing latency by **7-9x** across all percentiles. See the [Performance Analysis](#performance-analysis) section for detailed test methodology and results.
99

10-
1110
You can view the final implementation [here](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
1211

1312
## Why TensorRT + Triton?
@@ -22,10 +21,11 @@ TensorRT requires you to specify optimization parameters upfront - GPU architect
2221

2322
NVIDIA Triton Inference Server streamlines production AI deployment by handling operational concerns that are critical for serving models at scale. It provides automatic request batching, health checks, metrics collection, and standardized HTTP/gRPC APIs out of the box.
2423

25-
Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
24+
Triton supports multiple frameworks (TensorRT, PyTorch, TensorFlow, ONNX, etc.), offers built-in Prometheus metrics for observability, and integrates seamlessly with Kubernetes for auto-scaling. It also supports model versioning, A/B testing, and can chain multiple models into pipelines.
2625
[Here](https://substackcdn.com/image/fetch/$s_!FEPb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d4460ad-0e7e-4545-aee6-274b93dd5959_2300x2304.gif) is a diagram of how Triton works.
2726

2827
Below is the process of how the two work together in terms of handling requests:
28+
2929
1. Client sends text via HTTP/gRPC to Triton
3030
2. Triton queues the request in the scheduler
3131
3. Triton batches incoming requests (waits for more or timeout)
@@ -57,7 +57,7 @@ In order to download the model to Cerebrium, you need to be [granted acces](http
5757

5858
## Implementation
5959

60-
All files should be placed in the same project directory.
60+
All files should be placed in the same project directory.
6161

6262
### Triton Model Configuration
6363

@@ -115,6 +115,7 @@ output [
115115
```
116116

117117
This configuration tells Triton:
118+
118119
- Use Python backend (runs our model.py)
119120
- Automatically batch up to 128 requests together for efficient GPU utilization
120121
- Use dynamic batching with a 100 microsecond queue delay to maximize batch sizes
@@ -126,7 +127,7 @@ This configuration tells Triton:
126127

127128
Triton's Python backend requires implementing a `TritonPythonModel` class with three key methods:
128129

129-
- **`initialize(args)`**: Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
130+
- **`initialize(args)`**: Called once when Triton loads the model. This is where you load the tokenizer and initialize TensorRT-LLM with your build configuration.
130131

131132
- **`execute(requests)`**: Called every time Triton has a batch ready. Triton automatically batches incoming requests (up to your configured `max_batch_size`) and passes them here. This method extracts prompts from each request, runs batch inference with TensorRT-LLM, and returns responses.
132133

@@ -155,81 +156,81 @@ class TritonPythonModel:
155156
"""Initialize TensorRT-LLM with PyTorch backend."""
156157
print("Loading tokenizer...")
157158
self.tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
158-
159+
159160
print("Initializing TensorRT-LLM...")
160161
plugin_config = PluginConfig.from_dict({
161162
"paged_kv_cache": True,
162163
})
163-
164+
164165
build_config = BuildConfig(
165166
plugin_config=plugin_config,
166167
max_input_len=4096,
167168
max_batch_size=128, # Matches Triton max_batch_size in config.pbtxt
168169
)
169-
170+
170171
self.llm = LLM(
171172
model=MODEL_DIR,
172173
build_config=build_config,
173174
tensor_parallel_size=torch.cuda.device_count(),
174175
)
175176
print("✓ Model ready")
176-
177+
177178
def execute(self, requests):
178179
"""
179180
Execute inference on batched requests.
180-
181+
181182
Triton automatically batches requests (up to max_batch_size: 128).
182183
This function processes the batch that Triton provides.
183184
"""
184185
try:
185186
prompts = []
186187
sampling_params_list = []
187188
original_prompts = []
188-
189+
189190
# Extract data from each request in the batch. We need to look through requests: https://github.com/triton-inference-server/python_backend?tab=readme-ov-file#execute
190191
for request in requests:
191192
try:
192193
# Get input text - handle batched tensor structures
193194
input_tensor = pb_utils.get_input_tensor_by_name(request, "text_input")
194195
text_array = input_tensor.as_numpy()
195-
196+
196197
# Extract text handling different array structures
197198
if text_array.ndim == 0:
198199
text = text_array.item()
199200
elif text_array.dtype == object:
200201
text = text_array.flat[0] if text_array.size > 0 else text_array.item()
201202
else:
202203
text = text_array.flat[0] if text_array.size > 0 else text_array.item()
203-
204+
204205
# Decode if bytes
205206
if isinstance(text, bytes):
206207
text = text.decode('utf-8')
207208
elif isinstance(text, np.str_):
208209
text = str(text)
209-
210+
210211
# Get optional parameters with defaults
211212
max_tokens = 1024
212213
if pb_utils.get_input_tensor_by_name(request, "max_tokens") is not None:
213214
max_tokens_array = pb_utils.get_input_tensor_by_name(request, "max_tokens").as_numpy()
214215
max_tokens = int(max_tokens_array.item() if max_tokens_array.ndim == 0 else max_tokens_array.flat[0])
215-
216+
216217
temperature = 0.8
217218
if pb_utils.get_input_tensor_by_name(request, "temperature") is not None:
218219
temp_array = pb_utils.get_input_tensor_by_name(request, "temperature").as_numpy()
219220
temperature = float(temp_array.item() if temp_array.ndim == 0 else temp_array.flat[0])
220-
221+
221222
top_p = 0.95
222223
if pb_utils.get_input_tensor_by_name(request, "top_p") is not None:
223224
top_p_array = pb_utils.get_input_tensor_by_name(request, "top_p").as_numpy()
224225
top_p = float(top_p_array.item() if top_p_array.ndim == 0 else top_p_array.flat[0])
225-
226+
226227
# Format prompt using chat template
227228
prompt = self.tokenizer.apply_chat_template(
228229
[{"role": "user", "content": text}],
229230
tokenize=False,
230231
add_generation_prompt=True
231232
)
232-
233+
233234
prompts.append(prompt)
234235
original_prompts.append(prompt)
235236
sampling_params_list.append(SamplingParams(
@@ -242,23 +243,23 @@ class TritonPythonModel:
242243
prompts.append("")
243244
original_prompts.append("")
244245
sampling_params_list.append(SamplingParams(max_tokens=1024))
245-
246+
246247
# Batch inference
247248
if not prompts:
248249
return []
249-
250+
250251
outputs = self.llm.generate(prompts, sampling_params_list)
251252

252253
# Create responses
253254
responses = []
254255
for i, output in enumerate(outputs):
255256
try:
256257
generated_text = output.outputs[0].text
257-
258+
258259
# Strip prompt from output if included
259260
if original_prompts[i] and original_prompts[i] in generated_text:
260261
generated_text = generated_text.replace(original_prompts[i], "").strip()
261-
262+
262263
responses.append(pb_utils.InferenceResponse(
263264
output_tensors=[pb_utils.Tensor(
264265
"text_output",
@@ -273,9 +274,9 @@ class TritonPythonModel:
273274
np.array([f"Error: {str(e)}".encode('utf-8')], dtype=object)
274275
)]
275276
))
276-
277+
277278
return responses
278-
279+
279280
except Exception as e:
280281
print(f"Error in execute: {e}", flush=True)
281282
return [
@@ -287,15 +288,14 @@ class TritonPythonModel:
287288
)
288289
for _ in requests
289290
]
290-
291+
291292
def finalize(self):
292293
"""Cleanup on shutdown."""
293294
if hasattr(self, 'llm'):
294295
self.llm.shutdown()
295296
torch.cuda.empty_cache()
296297
```
297298

298-
299299
### Model Download Script
300300

301301
To download our model, create `download_model.py`:
@@ -315,15 +315,15 @@ MODEL_DIR = Path("/persistent-storage/models") / MODEL_ID
315315
def download_model():
316316
"""Download model if not already present."""
317317
hf_token = os.environ.get("HF_AUTH_TOKEN")
318-
318+
319319
if not hf_token:
320320
print("WARNING: HF_AUTH_TOKEN not set")
321321
return
322-
322+
323323
if MODEL_DIR.exists() and any(MODEL_DIR.iterdir()):
324324
print("✓ Model already exists")
325325
return
326-
326+
327327
print("Downloading model...")
328328
login(token=hf_token)
329329
snapshot_download(
@@ -381,7 +381,7 @@ EXPOSE 8000 8001 8002
381381
CMD ["tritonserver", "--model-repository=/app/model_repository", "--http-port=8000", "--grpc-port=8001", "--metrics-port=8002"]
382382
```
383383

384-
The Dockerfile uses Nvidia's official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.
384+
The Dockerfile uses Nvidia's official Triton container with TensorRT-LLM pre-installed, creates the model repository structure that Triton expects, and copies our application files to the correct locations.
385385

386386
### Deployment Configuration
387387

@@ -419,6 +419,7 @@ dockerfile_path = "./Dockerfile"
419419
```
420420

421421
Key configuration details:
422+
422423
- `replica_concurrency = 128`: Each replica can handle up to 128 concurrent requests, matching our Triton batch size
423424
- `max_replicas = 5`: Scale up to 5 replicas for peak load
424425

@@ -477,7 +478,9 @@ The endpoint returns results in this format:
477478
"name": "text_output",
478479
"datatype": "BYTES",
479480
"shape": [1],
480-
"data": ["Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."]
481+
"data": [
482+
"Machine learning is a subset of artificial intelligence (AI) that involves training algorithms..."
483+
]
481484
}
482485
]
483486
}
@@ -492,12 +495,14 @@ The response follows Triton's standard inference protocol format with the genera
492495
To validate the performance improvements of TensorRT + Triton, we compared it against a vanilla HuggingFace baseline serving the same Llama 3.2 3B Instruct model. Both deployments used identical hardware (NVIDIA A10 GPU) and were tested under the same load conditions.
493496

494497
**Vanilla Baseline Setup:**
498+
495499
- Model served directly using HuggingFace Transformers with PyTorch
496500
- Single request processing (no batching)
497501
- Standard FastAPI endpoint
498502
- Same hardware configuration (A10 GPU, 4 CPU cores, 40GB memory)
499503

500504
**TensorRT + Triton Setup:**
505+
501506
- TensorRT-LLM with PyTorch backend
502507
- Triton Inference Server with dynamic batching (max batch size: 128)
503508
- Automatic request queuing and batching
@@ -507,21 +512,20 @@ Both deployments were tested with the same load testing parameters to ensure fai
507512

508513
### Results
509514

510-
| Metric | Vanilla Baseline | TensorRT + Triton | Improvement |
511-
|--------|------------------|-------------------|-------------|
512-
| **Requests Per Second (RPS)** | 0.83 | 12.46 | **15x faster** |
513-
| **Success Rate** | 61.6% | 100.0% | **38.4% increase** |
514-
| **P50 Latency** | 297.7s | 41.7s | **7.1x faster** |
515-
| **P99 Latency** | 593.2s | 79.3s | **7.5x faster** |
516-
| **Average Latency** | 376.2s | 42.4s | **8.9x faster** |
517-
515+
| Metric | Vanilla Baseline | TensorRT + Triton | Improvement |
516+
| ----------------------------- | ---------------- | ----------------- | ------------------ |
517+
| **Requests Per Second (RPS)** | 0.83 | 12.46 | **15x faster** |
518+
| **Success Rate** | 61.6% | 100.0% | **38.4% increase** |
519+
| **P50 Latency** | 297.7s | 41.7s | **7.1x faster** |
520+
| **P99 Latency** | 593.2s | 79.3s | **7.5x faster** |
521+
| **Average Latency** | 376.2s | 42.4s | **8.9x faster** |
518522

519523
The TensorRT + Triton setup delivers **15x higher throughput** with **100% reliability** compared to the baseline, while reducing latency by **7-9x** across all percentiles. The baseline's 61.6% success rate and high latency come from processing requests sequentially without batching, leading to GPU underutilization and request timeouts. TensorRT + Triton eliminates these issues by keeping the GPU fully utilized with batched, optimized inference, resulting in 100% success rate and consistent, predictable latency.
520524

521525
These results demonstrate that TensorRT + Triton is not just faster, but also more reliable and cost-effective for production LLM serving at scale.
522526

523527
## Get Started
524528

525-
The complete implementation, including all configuration files and deployment scripts, is available in our [GitHub repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
529+
The complete implementation, including all configuration files and deployment scripts, is available in our [GitHub repository](https://github.com/CerebriumAI/examples/tree/master/5-large-language-models/8-faster-inference-with-triton-tensorrt).
526530

527-
Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for your own models) with TensorRT-LLM and Triton Inference Server. You'll have a production-ready, high-performance LLM serving endpoint in minutes.
531+
Clone the repository and follow this tutorial to deploy Llama 3.2 3B (or adapt it for your own models) with TensorRT-LLM and Triton Inference Server. You'll have a production-ready, high-performance LLM serving endpoint in minutes.

0 commit comments

Comments
 (0)