Skip to content

Commit b4d6cc2

Browse files
committed
Update README for 21.04 release
1 parent 1cb1f5f commit b4d6cc2

File tree

1 file changed

+300
-2
lines changed

1 file changed

+300
-2
lines changed

README.md

+300-2
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,303 @@
3030

3131
# Triton Inference Server
3232

33-
**NOTE: You are currently on the r21.04 branch which tracks stabilization
34-
towards the next release. This branch is not usable during stabilization.**
33+
Triton Inference Server provides a cloud and edge inferencing solution
34+
optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
35+
GRPC protocol that allows remote clients to request inferencing for
36+
any model being managed by the server. For edge deployments, Triton is
37+
available as a shared library with a C API that allows the full
38+
functionality of Triton to be included directly in an
39+
application.
40+
41+
## What's New in 2.9.0
42+
43+
* Python backend performance has been increased significantly.
44+
45+
* Onnx Runtime update to version 1.7.1.
46+
47+
* Triton Server is now available as a GKE Marketplace Application, see
48+
https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app.
49+
50+
* The GRPC client libraries now allow compression to be enabled.
51+
52+
* Ragged batching is now supported for TensorFlow models.
53+
54+
* For TensorFlow models represented with SavedModel format, it is now possible
55+
to choose which graph and signature_def to load. See
56+
https://github.com/triton-inference-server/tensorflow_backend/tree/r21.04#parameters.
57+
58+
* A Helm Chart example is added for AWS. See
59+
https://github.com/triton-inference-server/server/tree/master/deploy/aws.
60+
61+
* The Model Control API is enhanced to provide an option when unloading an
62+
ensemble model. The option allows all contained models to be unloaded as part
63+
of unloading the ensemble. See
64+
https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#model-repository-extension.
65+
66+
* Model reloading using the Model Control API previously resulted in the model
67+
being unavailable for a short period of time. This is now fixed so that the
68+
model remains available during reloading.
69+
70+
* Latency statistics and metrics for TensorRT models are fixed. Previously the
71+
sum of the "compute input", "compute infer" and "compute output" times
72+
accurately indicated the entire compute time but the total time could be
73+
incorrectly attributed across the three components. This incorrect attribution
74+
is now fixed and all values are now accurate.
75+
76+
* Error reporting is improved for the Azure, S3 and GCS cloud file system
77+
support.
78+
79+
* Fix trace support for ensembles. The models contained within an ensemble are
80+
now traced correctly.
81+
82+
* Model Analyzer improvements
83+
84+
* Summary report now includes GPU Power usage
85+
* Model Analyzer will find the Top N model configuration across multiple models.
86+
87+
## Features
88+
89+
* [Multiple deep-learning
90+
frameworks](https://github.com/triton-inference-server/backend). Triton
91+
can manage any number and mix of models (limited by system disk and
92+
memory resources). Triton supports TensorRT, TensorFlow GraphDef,
93+
TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
94+
formats. Both TensorFlow 1.x and TensorFlow 2.x are
95+
supported. Triton also supports TensorFlow-TensorRT and
96+
ONNX-TensorRT integrated models.
97+
98+
* [Concurrent model
99+
execution](docs/architecture.md#concurrent-model-execution). Multiple
100+
models (or multiple instances of the same model) can run
101+
simultaneously on the same GPU or on multiple GPUs.
102+
103+
* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
104+
models that support batching, Triton implements multiple scheduling
105+
and batching algorithms that combine individual inference requests
106+
together to improve inference throughput. These scheduling and
107+
batching decisions are transparent to the client requesting
108+
inference.
109+
110+
* [Extensible
111+
backends](https://github.com/triton-inference-server/backend). In
112+
addition to deep-learning frameworks, Triton provides a *backend
113+
API* that allows Triton to be extended with any model execution
114+
logic implemented in
115+
[Python](https://github.com/triton-inference-server/python_backend)
116+
or
117+
[C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
118+
while still benefiting from the CPU and GPU support, concurrent
119+
execution, dynamic batching and other features provided by Triton.
120+
121+
* [Model pipelines](docs/architecture.md#ensemble-models). Triton
122+
*ensembles* represents a pipeline of one or more models and the
123+
connection of input and output tensors between those models. A
124+
single inference request to an ensemble will trigger the execution
125+
of the entire pipeline.
126+
127+
* [HTTP/REST and GRPC inference
128+
protocols](docs/inference_protocols.md) based on the community
129+
developed [KFServing
130+
protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
131+
132+
* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
133+
linked directly into your application for edge and other in-process
134+
use cases.
135+
136+
* [Metrics](docs/metrics.md) indicating GPU utilization, server
137+
throughput, and server latency. The metrics are provided in
138+
Prometheus data format.
139+
140+
## Documentation
141+
142+
**The master branch documentation tracks the upcoming,
143+
under-development release and so may not be accurate for the current
144+
release of Triton. See the [r21.03
145+
documentation](https://github.com/triton-inference-server/server/tree/r21.03#documentation)
146+
for the current release.**
147+
148+
[Triton Architecture](docs/architecture.md) gives a high-level
149+
overview of the structure and capabilities of the inference
150+
server. There is also an [FAQ](docs/faq.md). Additional documentation
151+
is divided into [*user*](#user-documentation) and
152+
[*developer*](#developer-documentation) sections. The *user*
153+
documentation describes how to use Triton as an inference solution,
154+
including information on how to configure Triton, how to organize and
155+
configure your models, how to use the C++ and Python clients, etc. The
156+
*developer* documentation describes how to build and test Triton and
157+
also how Triton can be extended with new functionality.
158+
159+
The Triton [Release
160+
Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
161+
and [Support
162+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
163+
indicate the required versions of the NVIDIA Driver and CUDA, and also
164+
describe supported GPUs.
165+
166+
### User Documentation
167+
168+
- [QuickStart](docs/quickstart.md)
169+
- [Install](docs/quickstart.md#install-triton-docker-image)
170+
- [Run](docs/quickstart.md#run-triton)
171+
- [Model Repository](docs/model_repository.md)
172+
- [Model Configuration](docs/model_configuration.md)
173+
- [Model Management](docs/model_management.md)
174+
- [Custom Operations](docs/custom_operations.md)
175+
- [Client Libraries](docs/client_libraries.md)
176+
- [Client Examples](docs/client_examples.md)
177+
- [Optimization](docs/optimization.md)
178+
- [Model Analyzer](docs/model_analyzer.md)
179+
- [Performance Analyzer](docs/perf_analyzer.md)
180+
- [Metrics](docs/metrics.md)
181+
182+
The [quickstart](docs/quickstart.md) walks you through all the steps
183+
required to install and run Triton with an example image
184+
classification model and then use an example client application to
185+
perform inferencing using that model. The quickstart also demonstrates
186+
how [Triton supports both GPU systems and CPU-only
187+
systems](docs/quickstart.md#run-triton).
188+
189+
The first step in using Triton to serve your models is to place one or
190+
more models into a [model
191+
repository](docs/model_repository.md). Optionally, depending on the type
192+
of the model and on what Triton capabilities you want to enable for
193+
the model, you may need to create a [model
194+
configuration](docs/model_configuration.md) for the model. If your
195+
model has [custom operations](docs/custom_operations.md) you will need
196+
to make sure they are loaded correctly by Triton.
197+
198+
After you have your model(s) available in Triton, you will want to
199+
send inference and other requests to Triton from your *client*
200+
application. The [Python and C++ client
201+
libraries](docs/client_libraries.md) provide
202+
[APIs](docs/client_libraries.md#client-library-apis) to simplify this
203+
communication. There are also a large number of [client
204+
examples](docs/client_examples.md) that demonstrate how to use the
205+
libraries. You can also send HTTP/REST requests directly to Triton
206+
using the [HTTP/REST JSON-based
207+
protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
208+
[generate a GRPC client for many other
209+
languages](docs/client_libraries.md).
210+
211+
Understanding and [optimizing performance](docs/optimization.md) is an
212+
important part of deploying your models. The Triton project provides
213+
the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
214+
Analyzer](docs/model_analyzer.md) to help your optimization
215+
efforts. Specifically, you will want to optimize [scheduling and
216+
batching](docs/architecture.md#models-and-schedulers) and [model
217+
instances](docs/model_configuration.md#instance-groups) appropriately
218+
for each model. You may also want to consider [ensembling multiple
219+
models and pre/post-processing](docs/architecture.md#ensemble-models)
220+
into a pipeline. In some cases you may find [individual inference
221+
request trace data](docs/trace.md) useful when optimizing. A
222+
[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
223+
and monitor aggregate inference metrics.
224+
225+
NVIDIA publishes a number of [deep learning
226+
examples](https://github.com/NVIDIA/DeepLearningExamples) that use
227+
Triton.
228+
229+
As part of your deployment strategy you may want to [explicitly manage
230+
what models are available by loading and unloading
231+
models](docs/model_management.md) from a running Triton server. If you
232+
are using Kubernetes for deployment there are simple examples of how
233+
to deploy Triton using Kubernetes and Helm, one for
234+
[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md).
235+
236+
The [version 1 to version 2 migration
237+
information](docs/v1_to_v2.md) is helpful if you are moving to
238+
version 2 of Triton from previously using version 1.
239+
240+
### Developer Documentation
241+
242+
- [Build](docs/build.md)
243+
- [Protocols and APIs](docs/inference_protocols.md).
244+
- [Backends](https://github.com/triton-inference-server/backend)
245+
- [Repository Agents](docs/repository_agents.md)
246+
- [Test](docs/test.md)
247+
248+
Triton can be [built using
249+
Docker](docs/build.md#building-triton-with-docker) or [built without
250+
Docker](docs/build.md#building-triton-without-docker). After building
251+
you should [test Triton](docs/test.md).
252+
253+
It is also possible to [create a Docker image containing a customized
254+
Triton](docs/compose.md) that contains only a subset of the backends.
255+
256+
The Triton project also provides [client libraries for Python and
257+
C++](docs/client_libraries.md) that make it easy to communicate with
258+
the server. There are also a large number of [example
259+
clients](docs/client_examples.md) that demonstrate how to use the
260+
libraries. You can also develop your own clients that directly
261+
communicate with Triton using [HTTP/REST or GRPC
262+
protocols](docs/inference_protocols.md). There is also a [C
263+
API](docs/inference_protocols.md) that allows Triton to be linked
264+
directly into your application.
265+
266+
A [Triton backend](https://github.com/triton-inference-server/backend)
267+
is the implementation that executes a model. A backend can interface
268+
with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
269+
ONNX Runtime; or it can interface with a data processing framework
270+
like [DALI](https://github.com/triton-inference-server/dali_backend);
271+
or you can extend Triton by [writing your own
272+
backend](https://github.com/triton-inference-server/backend) in either
273+
[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
274+
or
275+
[Python](https://github.com/triton-inference-server/python_backend).
276+
277+
A [Triton repository agent](docs/repository_agents.md) extends Triton
278+
with new functionality that operates when a model is loaded or
279+
unloaded. You can introduce your own code to perform authentication,
280+
decryption, conversion, or similar operations when a model is loaded.
281+
282+
## Papers and Presentation
283+
284+
* [Maximizing Deep Learning Inference Performance with NVIDIA Model
285+
Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
286+
287+
* [High-Performance Inferencing at Scale Using the TensorRT Inference
288+
Server](https://developer.nvidia.com/gtc/2020/video/s22418).
289+
290+
* [Accelerate and Autoscale Deep Learning Inference on GPUs with
291+
KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
292+
293+
* [Deep into Triton Inference Server: BERT Practical Deployment on
294+
NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
295+
296+
* [Maximizing Utilization for Data Center Inference with TensorRT
297+
Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
298+
299+
* [NVIDIA TensorRT Inference Server Boosts Deep Learning
300+
Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
301+
302+
* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
303+
Inference Server and
304+
Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
305+
306+
## Contributing
307+
308+
Contributions to Triton Inference Server are more than welcome. To
309+
contribute make a pull request and follow the guidelines outlined in
310+
[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
311+
example or similar contribution that is not modifying the core of
312+
Triton, then you should file a PR in the [contrib
313+
repo](https://github.com/triton-inference-server/contrib).
314+
315+
## Reporting problems, asking questions
316+
317+
We appreciate any feedback, questions or bug reporting regarding this
318+
project. When help with code is needed, follow the process outlined in
319+
the Stack Overflow (https://stackoverflow.com/help/mcve)
320+
document. Ensure posted examples are:
321+
322+
* minimal – use as little code as possible that still produces the
323+
same problem
324+
325+
* complete – provide all parts needed to reproduce the problem. Check
326+
if you can strip external dependency and still show the problem. The
327+
less time we spend on reproducing problems the more time we have to
328+
fix it
329+
330+
* verifiable – test the code you're about to provide to make sure it
331+
reproduces the problem. Remove all other problems that are not
332+
related to your request/question.

0 commit comments

Comments
 (0)