Skip to content

Commit 39b9337

Browse files
committed
Update README for 21.09 release
1 parent 3f77dae commit 39b9337

File tree

1 file changed

+272
-2
lines changed

1 file changed

+272
-2
lines changed

README.md

+272-2
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,275 @@
3030

3131
# Triton Inference Server
3232

33-
**NOTE: You are currently on the r21.09 branch which tracks stabilization
34-
towards the next release. This branch is not usable during stabilization.**
33+
Triton Inference Server provides a cloud and edge inferencing solution
34+
optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
35+
GRPC protocol that allows remote clients to request inferencing for
36+
any model being managed by the server. For edge deployments, Triton is
37+
available as a shared library with a C API that allows the full
38+
functionality of Triton to be included directly in an
39+
application.
40+
41+
## What's New in 2.9.0
42+
43+
* Full-featured, Beta version of
44+
[Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend/tree/r21.09#business-logic-scripting-beta)
45+
released.
46+
47+
* Beta version for basic JAVA Client released. See
48+
<https://github.com/triton-inference-server/client/tree/r21.09/src/java> for a
49+
list of supported features.
50+
51+
* A stack trace is now printed when Triton crashes to aid in debugging.
52+
53+
* The [Triton Client SDK](https://github.com/triton-inference-server/client/tree/r21.09#download-using-python-package-installer-pip)
54+
wheel file is now available directly from PyPI for both Ubuntu and Windows.
55+
56+
* The TensorRT backend is now an optional part of Triton just like all the
57+
other backends. The [compose utility](docs/compose.md)
58+
can be used to create a Triton container that does not contain the TensorRT
59+
backend.
60+
61+
* Model Analyzer can profile with perf_analyzer's C-API.
62+
63+
* Model Analyzer can use the CUDA Device Index in addition to the GPU UUID in
64+
the `--gpus` flag.
65+
66+
## Features
67+
68+
* [Multiple deep-learning
69+
frameworks](https://github.com/triton-inference-server/backend). Triton
70+
can manage any number and mix of models (limited by system disk and
71+
memory resources). Triton supports TensorRT, TensorFlow GraphDef,
72+
TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
73+
formats. Both TensorFlow 1.x and TensorFlow 2.x are
74+
supported. Triton also supports TensorFlow-TensorRT and
75+
ONNX-TensorRT integrated models.
76+
77+
* [Concurrent model
78+
execution](docs/architecture.md#concurrent-model-execution). Multiple
79+
models (or multiple instances of the same model) can run
80+
simultaneously on the same GPU or on multiple GPUs.
81+
82+
* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
83+
models that support batching, Triton implements multiple scheduling
84+
and batching algorithms that combine individual inference requests
85+
together to improve inference throughput. These scheduling and
86+
batching decisions are transparent to the client requesting
87+
inference.
88+
89+
* [Extensible
90+
backends](https://github.com/triton-inference-server/backend). In
91+
addition to deep-learning frameworks, Triton provides a *backend
92+
API* that allows Triton to be extended with any model execution
93+
logic implemented in
94+
[Python](https://github.com/triton-inference-server/python_backend/tree/r21.09)
95+
or
96+
[C++](https://github.com/triton-inference-server/backend/blob/r21.09/README.md#triton-backend-api),
97+
while still benefiting from the CPU and GPU support, concurrent
98+
execution, dynamic batching and other features provided by Triton.
99+
100+
* [Model pipelines](docs/architecture.md#ensemble-models). Triton
101+
*ensembles* represents a pipeline of one or more models and the
102+
connection of input and output tensors between those models. A
103+
single inference request to an ensemble will trigger the execution
104+
of the entire pipeline.
105+
106+
* [HTTP/REST and GRPC inference
107+
protocols](docs/inference_protocols.md) based on the community
108+
developed [KFServing
109+
protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
110+
111+
* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
112+
linked directly into your application for edge and other in-process
113+
use cases.
114+
115+
* [Metrics](docs/metrics.md) indicating GPU utilization, server
116+
throughput, and server latency. The metrics are provided in
117+
Prometheus data format.
118+
119+
## Documentation
120+
121+
[Triton Architecture](docs/architecture.md) gives a high-level
122+
overview of the structure and capabilities of the inference
123+
server. There is also an [FAQ](docs/faq.md). Additional documentation
124+
is divided into [*user*](#user-documentation) and
125+
[*developer*](#developer-documentation) sections. The *user*
126+
documentation describes how to use Triton as an inference solution,
127+
including information on how to configure Triton, how to organize and
128+
configure your models, how to use the C++ and Python clients, etc. The
129+
*developer* documentation describes how to build and test Triton and
130+
also how Triton can be extended with new functionality.
131+
132+
The Triton [Release
133+
Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
134+
and [Support
135+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
136+
indicate the required versions of the NVIDIA Driver and CUDA, and also
137+
describe supported GPUs.
138+
139+
### User Documentation
140+
141+
* [QuickStart](docs/quickstart.md)
142+
* [Install](docs/quickstart.md#install-triton-docker-image)
143+
* [Run](docs/quickstart.md#run-triton)
144+
* [Model Repository](docs/model_repository.md)
145+
* [Model Configuration](docs/model_configuration.md)
146+
* [Model Management](docs/model_management.md)
147+
* [Custom Operations](docs/custom_operations.md)
148+
* [Client Libraries and Examples](https://github.com/triton-inference-server/client)
149+
* [Optimization](docs/optimization.md)
150+
* [Model Analyzer](docs/model_analyzer.md)
151+
* [Performance Analyzer](docs/perf_analyzer.md)
152+
* [Metrics](docs/metrics.md)
153+
* [Jetson and JetPack](docs/jetson.md)
154+
155+
The [quickstart](docs/quickstart.md) walks you through all the steps
156+
required to install and run Triton with an example image
157+
classification model and then use an example client application to
158+
perform inferencing using that model. The quickstart also demonstrates
159+
how [Triton supports both GPU systems and CPU-only
160+
systems](docs/quickstart.md#run-triton).
161+
162+
The first step in using Triton to serve your models is to place one or
163+
more models into a [model
164+
repository](docs/model_repository.md). Optionally, depending on the type
165+
of the model and on what Triton capabilities you want to enable for
166+
the model, you may need to create a [model
167+
configuration](docs/model_configuration.md) for the model. If your
168+
model has [custom operations](docs/custom_operations.md) you will need
169+
to make sure they are loaded correctly by Triton.
170+
171+
After you have your model(s) available in Triton, you will want to
172+
send inference and other requests to Triton from your *client*
173+
application. The [Python and C++ client
174+
libraries](https://github.com/triton-inference-server/client) provide
175+
APIs to simplify this communication. There are also a large number of
176+
[client examples](https://github.com/triton-inference-server/client)
177+
that demonstrate how to use the libraries. You can also send
178+
HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based
179+
protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
180+
[generate a GRPC client for many other
181+
languages](https://github.com/triton-inference-server/client).
182+
183+
Understanding and [optimizing performance](docs/optimization.md) is an
184+
important part of deploying your models. The Triton project provides
185+
the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
186+
Analyzer](docs/model_analyzer.md) to help your optimization
187+
efforts. Specifically, you will want to optimize [scheduling and
188+
batching](docs/architecture.md#models-and-schedulers) and [model
189+
instances](docs/model_configuration.md#instance-groups) appropriately
190+
for each model. You may also want to consider [ensembling multiple
191+
models and pre/post-processing](docs/architecture.md#ensemble-models)
192+
into a pipeline. In some cases you may find [individual inference
193+
request trace data](docs/trace.md) useful when optimizing. A
194+
[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
195+
and monitor aggregate inference metrics.
196+
197+
NVIDIA publishes a number of [deep learning
198+
examples](https://github.com/NVIDIA/DeepLearningExamples) that use
199+
Triton.
200+
201+
As part of your deployment strategy you may want to [explicitly manage
202+
what models are available by loading and unloading
203+
models](docs/model_management.md) from a running Triton server. If you
204+
are using Kubernetes for deployment there are simple examples of how
205+
to deploy Triton using Kubernetes and Helm, one for
206+
[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md).
207+
208+
The [version 1 to version 2 migration
209+
information](docs/v1_to_v2.md) is helpful if you are moving to
210+
version 2 of Triton from previously using version 1.
211+
212+
### Developer Documentation
213+
214+
* [Build](docs/build.md)
215+
* [Protocols and APIs](docs/inference_protocols.md).
216+
* [Backends](https://github.com/triton-inference-server/backend)
217+
* [Repository Agents](docs/repository_agents.md)
218+
* [Test](docs/test.md)
219+
220+
Triton can be [built using
221+
Docker](docs/build.md#building-triton-with-docker) or [built without
222+
Docker](docs/build.md#building-triton-without-docker). After building
223+
you should [test Triton](docs/test.md).
224+
225+
It is also possible to [create a Docker image containing a customized
226+
Triton](docs/compose.md) that contains only a subset of the backends.
227+
228+
The Triton project also provides [client libraries for Python and
229+
C++](https://github.com/triton-inference-server/client) that make it
230+
easy to communicate with the server. There are also a large number of
231+
[example clients](https://github.com/triton-inference-server/client)
232+
that demonstrate how to use the libraries. You can also develop your
233+
own clients that directly communicate with Triton using [HTTP/REST or
234+
GRPC protocols](docs/inference_protocols.md). There is also a [C
235+
API](docs/inference_protocols.md) that allows Triton to be linked
236+
directly into your application.
237+
238+
A [Triton backend](https://github.com/triton-inference-server/backend)
239+
is the implementation that executes a model. A backend can interface
240+
with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
241+
ONNX Runtime; or it can interface with a data processing framework
242+
like [DALI](https://github.com/triton-inference-server/dali_backend);
243+
or you can extend Triton by [writing your own
244+
backend](https://github.com/triton-inference-server/backend) in either
245+
[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
246+
or
247+
[Python](https://github.com/triton-inference-server/python_backend).
248+
249+
A [Triton repository agent](docs/repository_agents.md) extends Triton
250+
with new functionality that operates when a model is loaded or
251+
unloaded. You can introduce your own code to perform authentication,
252+
decryption, conversion, or similar operations when a model is loaded.
253+
254+
## Papers and Presentation
255+
256+
* [Maximizing Deep Learning Inference Performance with NVIDIA Model
257+
Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
258+
259+
* [High-Performance Inferencing at Scale Using the TensorRT Inference
260+
Server](https://developer.nvidia.com/gtc/2020/video/s22418).
261+
262+
* [Accelerate and Autoscale Deep Learning Inference on GPUs with
263+
KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
264+
265+
* [Deep into Triton Inference Server: BERT Practical Deployment on
266+
NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
267+
268+
* [Maximizing Utilization for Data Center Inference with TensorRT
269+
Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
270+
271+
* [NVIDIA TensorRT Inference Server Boosts Deep Learning
272+
Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
273+
274+
* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
275+
Inference Server and
276+
Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
277+
278+
## Contributing
279+
280+
Contributions to Triton Inference Server are more than welcome. To
281+
contribute make a pull request and follow the guidelines outlined in
282+
[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
283+
example or similar contribution that is not modifying the core of
284+
Triton, then you should file a PR in the [contrib
285+
repo](https://github.com/triton-inference-server/contrib).
286+
287+
## Reporting problems, asking questions
288+
289+
We appreciate any feedback, questions or bug reporting regarding this
290+
project. When help with code is needed, follow the process outlined in
291+
the Stack Overflow (<https://stackoverflow.com/help/mcve>)
292+
document. Ensure posted examples are:
293+
294+
* minimal – use as little code as possible that still produces the
295+
same problem
296+
297+
* complete – provide all parts needed to reproduce the problem. Check
298+
if you can strip external dependency and still show the problem. The
299+
less time we spend on reproducing problems the more time we have to
300+
fix it
301+
302+
* verifiable – test the code you're about to provide to make sure it
303+
reproduces the problem. Remove all other problems that are not
304+
related to your request/question.

0 commit comments

Comments
 (0)