Skip to content

Commit 30b08ba

Browse files
authored
Update README and versions for 21.06 release (#3035)
1 parent 19dd635 commit 30b08ba

File tree

1 file changed

+294
-2
lines changed

1 file changed

+294
-2
lines changed

README.md

+294-2
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,297 @@
3030

3131
# Triton Inference Server
3232

33-
**NOTE: You are currently on the r21.06 branch which tracks stabilization
34-
towards the next release. This branch is not usable during stabilization.**
33+
Triton Inference Server provides a cloud and edge inferencing solution
34+
optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
35+
GRPC protocol that allows remote clients to request inferencing for
36+
any model being managed by the server. For edge deployments, Triton is
37+
available as a shared library with a C API that allows the full
38+
functionality of Triton to be included directly in an
39+
application.
40+
41+
The current release of the Triton Inference Server is 2.11.0 and
42+
corresponds to the 21.06 release of the tritonserver container on
43+
[NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com). The branch for this
44+
release is
45+
[r21.06](https://github.com/triton-inference-server/server/tree/r21.06).
46+
47+
## What's New in 2.11.0
48+
49+
* The [Forest Inference Library (FIL)](https://github.com/triton-inference-server/fil_backend)
50+
backend is added to Triton. The FIL backend allows forest models trained by
51+
several popular machine learning frameworks (including XGBoost, LightGBM,
52+
Scikit-Learn, and cuML) to be deployed in a Triton.
53+
54+
* Windows version of Triton now includes the
55+
[OpenVino backend](https://github.com/triton-inference-server/openvino_backend).
56+
57+
* The Performance Analyzer (perf_analyzer) now supports testing against the
58+
Triton C API.
59+
60+
* The Python backend now allows the use of conda to create a unique execution
61+
environment for your Python model. See https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments.
62+
63+
* Python models that crash or exit unexpectedly are now automatically restarted
64+
by Triton.
65+
66+
* Model repositories in S3 storage can now be accessed using HTTPS protocol. See
67+
https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#s3
68+
for more information.
69+
70+
* Triton now collects GPU metrics for MIG partitions.
71+
72+
* Passive model instances can now be specified in the model configuration. A
73+
passive model instance will be loaded and initialized by Triton, but no
74+
inference requests will be sent to the instance. Passive instances are
75+
typically used by a custom backend that uses its own mechanisms to distribute
76+
work to the passive instances. See the ModelInstanceGroup section of
77+
[model_config.proto](https://github.com/triton-inference-server/common/blob/r21.06/protobuf/model_config.proto) for the setting.
78+
79+
* NVDLA support is added to the TensorRT backend.
80+
81+
* ONNX Runtime version updated to 1.8.0.
82+
83+
* Windows build documentation simplified and improved.
84+
85+
* Improved detailed and summary reports in Model Analyzer.
86+
87+
* Added an offline mode to Model Analyzer.
88+
89+
## Features
90+
91+
* [Multiple deep-learning
92+
frameworks](https://github.com/triton-inference-server/backend). Triton
93+
can manage any number and mix of models (limited by system disk and
94+
memory resources). Triton supports TensorRT, TensorFlow GraphDef,
95+
TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
96+
formats. Both TensorFlow 1.x and TensorFlow 2.x are
97+
supported. Triton also supports TensorFlow-TensorRT and
98+
ONNX-TensorRT integrated models.
99+
100+
* [Concurrent model
101+
execution](docs/architecture.md#concurrent-model-execution). Multiple
102+
models (or multiple instances of the same model) can run
103+
simultaneously on the same GPU or on multiple GPUs.
104+
105+
* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
106+
models that support batching, Triton implements multiple scheduling
107+
and batching algorithms that combine individual inference requests
108+
together to improve inference throughput. These scheduling and
109+
batching decisions are transparent to the client requesting
110+
inference.
111+
112+
* [Extensible
113+
backends](https://github.com/triton-inference-server/backend). In
114+
addition to deep-learning frameworks, Triton provides a *backend
115+
API* that allows Triton to be extended with any model execution
116+
logic implemented in
117+
[Python](https://github.com/triton-inference-server/python_backend)
118+
or
119+
[C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
120+
while still benefiting from the CPU and GPU support, concurrent
121+
execution, dynamic batching and other features provided by Triton.
122+
123+
* [Model pipelines](docs/architecture.md#ensemble-models). Triton
124+
*ensembles* represents a pipeline of one or more models and the
125+
connection of input and output tensors between those models. A
126+
single inference request to an ensemble will trigger the execution
127+
of the entire pipeline.
128+
129+
* [HTTP/REST and GRPC inference
130+
protocols](docs/inference_protocols.md) based on the community
131+
developed [KFServing
132+
protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
133+
134+
* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
135+
linked directly into your application for edge and other in-process
136+
use cases.
137+
138+
* [Metrics](docs/metrics.md) indicating GPU utilization, server
139+
throughput, and server latency. The metrics are provided in
140+
Prometheus data format.
141+
142+
## Documentation
143+
144+
[Triton Architecture](docs/architecture.md) gives a high-level
145+
overview of the structure and capabilities of the inference
146+
server. There is also an [FAQ](docs/faq.md). Additional documentation
147+
is divided into [*user*](#user-documentation) and
148+
[*developer*](#developer-documentation) sections. The *user*
149+
documentation describes how to use Triton as an inference solution,
150+
including information on how to configure Triton, how to organize and
151+
configure your models, how to use the C++ and Python clients, etc. The
152+
*developer* documentation describes how to build and test Triton and
153+
also how Triton can be extended with new functionality.
154+
155+
The Triton [Release
156+
Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
157+
and [Support
158+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
159+
indicate the required versions of the NVIDIA Driver and CUDA, and also
160+
describe supported GPUs.
161+
162+
### User Documentation
163+
164+
- [QuickStart](docs/quickstart.md)
165+
- [Install](docs/quickstart.md#install-triton-docker-image)
166+
- [Run](docs/quickstart.md#run-triton)
167+
- [Model Repository](docs/model_repository.md)
168+
- [Model Configuration](docs/model_configuration.md)
169+
- [Model Management](docs/model_management.md)
170+
- [Custom Operations](docs/custom_operations.md)
171+
- [Client Libraries and Examples](https://github.com/triton-inference-server/client)
172+
- [Optimization](docs/optimization.md)
173+
- [Model Analyzer](docs/model_analyzer.md)
174+
- [Performance Analyzer](docs/perf_analyzer.md)
175+
- [Metrics](docs/metrics.md)
176+
177+
The [quickstart](docs/quickstart.md) walks you through all the steps
178+
required to install and run Triton with an example image
179+
classification model and then use an example client application to
180+
perform inferencing using that model. The quickstart also demonstrates
181+
how [Triton supports both GPU systems and CPU-only
182+
systems](docs/quickstart.md#run-triton).
183+
184+
The first step in using Triton to serve your models is to place one or
185+
more models into a [model
186+
repository](docs/model_repository.md). Optionally, depending on the type
187+
of the model and on what Triton capabilities you want to enable for
188+
the model, you may need to create a [model
189+
configuration](docs/model_configuration.md) for the model. If your
190+
model has [custom operations](docs/custom_operations.md) you will need
191+
to make sure they are loaded correctly by Triton.
192+
193+
After you have your model(s) available in Triton, you will want to
194+
send inference and other requests to Triton from your *client*
195+
application. The [Python and C++ client
196+
libraries](https://github.com/triton-inference-server/client) provide
197+
APIs to simplify this communication. There are also a large number of
198+
[client examples](https://github.com/triton-inference-server/client)
199+
that demonstrate how to use the libraries. You can also send
200+
HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based
201+
protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
202+
[generate a GRPC client for many other
203+
languages](https://github.com/triton-inference-server/client).
204+
205+
Understanding and [optimizing performance](docs/optimization.md) is an
206+
important part of deploying your models. The Triton project provides
207+
the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
208+
Analyzer](docs/model_analyzer.md) to help your optimization
209+
efforts. Specifically, you will want to optimize [scheduling and
210+
batching](docs/architecture.md#models-and-schedulers) and [model
211+
instances](docs/model_configuration.md#instance-groups) appropriately
212+
for each model. You may also want to consider [ensembling multiple
213+
models and pre/post-processing](docs/architecture.md#ensemble-models)
214+
into a pipeline. In some cases you may find [individual inference
215+
request trace data](docs/trace.md) useful when optimizing. A
216+
[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
217+
and monitor aggregate inference metrics.
218+
219+
NVIDIA publishes a number of [deep learning
220+
examples](https://github.com/NVIDIA/DeepLearningExamples) that use
221+
Triton.
222+
223+
As part of your deployment strategy you may want to [explicitly manage
224+
what models are available by loading and unloading
225+
models](docs/model_management.md) from a running Triton server. If you
226+
are using Kubernetes for deployment there are simple examples of how
227+
to deploy Triton using Kubernetes and Helm, one for
228+
[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md).
229+
230+
The [version 1 to version 2 migration
231+
information](docs/v1_to_v2.md) is helpful if you are moving to
232+
version 2 of Triton from previously using version 1.
233+
234+
### Developer Documentation
235+
236+
- [Build](docs/build.md)
237+
- [Protocols and APIs](docs/inference_protocols.md).
238+
- [Backends](https://github.com/triton-inference-server/backend)
239+
- [Repository Agents](docs/repository_agents.md)
240+
- [Test](docs/test.md)
241+
242+
Triton can be [built using
243+
Docker](docs/build.md#building-triton-with-docker) or [built without
244+
Docker](docs/build.md#building-triton-without-docker). After building
245+
you should [test Triton](docs/test.md).
246+
247+
It is also possible to [create a Docker image containing a customized
248+
Triton](docs/compose.md) that contains only a subset of the backends.
249+
250+
The Triton project also provides [client libraries for Python and
251+
C++](https://github.com/triton-inference-server/client) that make it
252+
easy to communicate with the server. There are also a large number of
253+
[example clients](https://github.com/triton-inference-server/client)
254+
that demonstrate how to use the libraries. You can also develop your
255+
own clients that directly communicate with Triton using [HTTP/REST or
256+
GRPC protocols](docs/inference_protocols.md). There is also a [C
257+
API](docs/inference_protocols.md) that allows Triton to be linked
258+
directly into your application.
259+
260+
A [Triton backend](https://github.com/triton-inference-server/backend)
261+
is the implementation that executes a model. A backend can interface
262+
with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
263+
ONNX Runtime; or it can interface with a data processing framework
264+
like [DALI](https://github.com/triton-inference-server/dali_backend);
265+
or you can extend Triton by [writing your own
266+
backend](https://github.com/triton-inference-server/backend) in either
267+
[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
268+
or
269+
[Python](https://github.com/triton-inference-server/python_backend).
270+
271+
A [Triton repository agent](docs/repository_agents.md) extends Triton
272+
with new functionality that operates when a model is loaded or
273+
unloaded. You can introduce your own code to perform authentication,
274+
decryption, conversion, or similar operations when a model is loaded.
275+
276+
## Papers and Presentation
277+
278+
* [Maximizing Deep Learning Inference Performance with NVIDIA Model
279+
Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
280+
281+
* [High-Performance Inferencing at Scale Using the TensorRT Inference
282+
Server](https://developer.nvidia.com/gtc/2020/video/s22418).
283+
284+
* [Accelerate and Autoscale Deep Learning Inference on GPUs with
285+
KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
286+
287+
* [Deep into Triton Inference Server: BERT Practical Deployment on
288+
NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
289+
290+
* [Maximizing Utilization for Data Center Inference with TensorRT
291+
Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
292+
293+
* [NVIDIA TensorRT Inference Server Boosts Deep Learning
294+
Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
295+
296+
* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
297+
Inference Server and
298+
Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
299+
300+
## Contributing
301+
302+
Contributions to Triton Inference Server are more than welcome. To
303+
contribute make a pull request and follow the guidelines outlined in
304+
[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
305+
example or similar contribution that is not modifying the core of
306+
Triton, then you should file a PR in the [contrib
307+
repo](https://github.com/triton-inference-server/contrib).
308+
309+
## Reporting problems, asking questions
310+
311+
We appreciate any feedback, questions or bug reporting regarding this
312+
project. When help with code is needed, follow the process outlined in
313+
the Stack Overflow (https://stackoverflow.com/help/mcve)
314+
document. Ensure posted examples are:
315+
316+
* minimal – use as little code as possible that still produces the
317+
same problem
318+
319+
* complete – provide all parts needed to reproduce the problem. Check
320+
if you can strip external dependency and still show the problem. The
321+
less time we spend on reproducing problems the more time we have to
322+
fix it
323+
324+
* verifiable – test the code you're about to provide to make sure it
325+
reproduces the problem. Remove all other problems that are not
326+
related to your request/question.

0 commit comments

Comments
 (0)