Skip to content

Commit a9cde85

Browse files
authored
Md files need to have only one heading for rst files to (#125)
show proper titles. e.g Python with HuggingFace <../tutorials/Quick_Deploy/HuggingFaceTransformers/README.md> will not show as "Python with HuggingFace" in userguides if the README had multiple headers.
1 parent 2cb9deb commit a9cde85

File tree

1 file changed

+6
-6
lines changed
  • Quick_Deploy/HuggingFaceTransformers

1 file changed

+6
-6
lines changed

Quick_Deploy/HuggingFaceTransformers/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2023-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -176,10 +176,10 @@ Using this technique you should be able to serve any transformer models supporte
176176
hugging face with Triton.
177177

178178

179-
# Next Steps
179+
## Next Steps
180180
The following sections expand on the base tutorial and provide guidance for future sandboxing.
181181

182-
## Loading Cached Models
182+
### Loading Cached Models
183183
In the previous steps, we downloaded the falcon-7b model from hugging face when we
184184
launched the Triton server. We can avoid this lengthy download process in subsequent runs
185185
by loading cached models into Triton. By default, the provided `model.py` files will cache
@@ -206,14 +206,14 @@ command from earlier (making sure to replace `${HOME}` with the path to your ass
206206
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface
207207
```
208208

209-
## Triton Tool Ecosystem
209+
### Triton Tool Ecosystem
210210
Deploying models in Triton also comes with the benefit of access to a fully-supported suite
211211
of deployment analyzers to help you better understand and tailor your systems to fit your
212212
needs. Triton currently has two options for deployment analysis:
213213
- [Performance Analyzer](https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton-inference-server-2310/user-guide/docs/user_guide/perf_analyzer.html): An inference performance optimizer.
214214
- [Model Analyzer](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_analyzer.html) A GPU memory and compute utilization optimizer.
215215

216-
### Performance Analyzer
216+
#### Performance Analyzer
217217
To use the performance analyzer, please remove the persimmon8b model from `model_repository` and restart
218218
the Triton server using the `docker run` command from above.
219219

@@ -289,7 +289,7 @@ guide.
289289
For more information regarding dynamic batching in Triton, please see [this](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#dynamic-batcher)
290290
guide.
291291

292-
### Model Analyzer
292+
#### Model Analyzer
293293

294294
In the performance analyzer section, we used intuition to increase our throughput by changing
295295
a subset of variables and measuring the difference in performance. However, we only changed

0 commit comments

Comments
 (0)