Skip to content

Commit a31b118

Browse files
committed
Update README and versions for 22.04 release
1 parent d085dac commit a31b118

File tree

1 file changed

+342
-2
lines changed

1 file changed

+342
-2
lines changed

README.md

+342-2
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,345 @@
3030

3131
# Triton Inference Server
3232

33-
**NOTE: You are currently on the r22.04 branch which tracks stabilization
34-
towards the next release. This branch is not usable during stabilization.**
33+
Triton Inference Server provides a cloud and edge inferencing solution
34+
optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
35+
GRPC protocol that allows remote clients to request inferencing for
36+
any model being managed by the server. For edge deployments, Triton is
37+
available as a shared library with a C API that allows the full
38+
functionality of Triton to be included directly in an
39+
application.
40+
41+
## What's New in 2.21.0
42+
43+
* Users can now specify a
44+
[customized temp directory](https://github.com/triton-inference-server/server/blob/r22.04/build.py#L1603-L1609)
45+
with the `--tmp-dir` argument to `build.py` during the container build.
46+
47+
* Users can now
48+
[send a raw binary request](https://github.com/triton-inference-server/server/blob/r22.04/docs/protocol/extension_binary_data.md#raw-binary-request)
49+
to eliminate the need for the specification of inference header.
50+
51+
* Ensembles now recognize
52+
[optional inputs](https://github.com/triton-inference-server/common/blob/r22.04/protobuf/model_config.proto#L402-L408).
53+
54+
* Users can now add custom metrics to the existing Triton metrics endpoint in
55+
their custom backends and applications using the Triton C API. Documentation can
56+
be found [here](https://github.com/triton-inference-server/server/blob/main/docs/metrics.md#custom-metrics).
57+
58+
* Official support for multiple cloud repositories. This includes the same as
59+
well as different cloud storage providers i.e. a single instance of Triton can
60+
load models from two S3 buckets, two GCS buckets and two Azure Storage
61+
containers.
62+
63+
* ONNX Runtime backend now
64+
[uses execution providers when available when autocomplete is enabled](https://github.com/triton-inference-server/onnxruntime_backend/commit/85d30dffd2a724e2e2e26861010abd246c960e09).
65+
This fixes the old behavior where it would always use the CPU execution provider.
66+
67+
* The build.py and compose.py now support
68+
[PyTorch](https://github.com/triton-inference-server/server/commit/821cdfef03da2066cda32b71dc63d74c441fd08c)
69+
and
70+
[TensorFlow 1](https://github.com/triton-inference-server/server/commit/4f1043aa3a802175cd663842e72bcb2915bcb1e0)
71+
backends for the CPU-only builds.
72+
73+
* Refer to the 22.04 column of the
74+
[Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
75+
for container image versions on which the 22.04 inference server container is
76+
based.
77+
78+
## Features
79+
80+
* [Multiple deep-learning
81+
frameworks](https://github.com/triton-inference-server/backend). Triton
82+
can manage any number and mix of models (limited by system disk and
83+
memory resources). Triton supports TensorRT, TensorFlow GraphDef,
84+
TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
85+
formats. Both TensorFlow 1.x and TensorFlow 2.x are
86+
supported. Triton also supports TensorFlow-TensorRT and
87+
ONNX-TensorRT integrated models.
88+
89+
* [Concurrent model
90+
execution](docs/architecture.md#concurrent-model-execution). Multiple
91+
models (or multiple instances of the same model) can run
92+
simultaneously on the same GPU or on multiple GPUs.
93+
94+
* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
95+
models that support batching, Triton implements multiple scheduling
96+
and batching algorithms that combine individual inference requests
97+
together to improve inference throughput. These scheduling and
98+
batching decisions are transparent to the client requesting
99+
inference.
100+
101+
* [Extensible
102+
backends](https://github.com/triton-inference-server/backend). In
103+
addition to deep-learning frameworks, Triton provides a *backend
104+
API* that allows Triton to be extended with any model execution
105+
logic implemented in
106+
[Python](https://github.com/triton-inference-server/python_backend)
107+
or
108+
[C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
109+
while still benefiting from the CPU and GPU support, concurrent
110+
execution, dynamic batching and other features provided by Triton.
111+
112+
* [Model pipelines](docs/architecture.md#ensemble-models). Triton
113+
*ensembles* represents a pipeline of one or more models and the
114+
connection of input and output tensors between those models. A
115+
single inference request to an ensemble will trigger the execution
116+
of the entire pipeline.
117+
118+
* [HTTP/REST and GRPC inference
119+
protocols](docs/inference_protocols.md) based on the community
120+
developed [KFServing
121+
protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
122+
123+
* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
124+
linked directly into your application for edge and other in-process
125+
use cases.
126+
127+
* [Metrics](docs/metrics.md) indicating GPU utilization, server
128+
throughput, and server latency. The metrics are provided in
129+
Prometheus data format.
130+
131+
## Documentation
132+
133+
[Triton Architecture](docs/architecture.md) gives a high-level
134+
overview of the structure and capabilities of the inference
135+
server. There is also an [FAQ](docs/faq.md). Additional documentation
136+
is divided into [*user*](#user-documentation) and
137+
[*developer*](#developer-documentation) sections. The *user*
138+
documentation describes how to use Triton as an inference solution,
139+
including information on how to configure Triton, how to organize and
140+
configure your models, how to use the C++ and Python clients, etc. The
141+
*developer* documentation describes how to build and test Triton and
142+
also how Triton can be extended with new functionality.
143+
144+
### User Documentation
145+
146+
* [QuickStart](docs/quickstart.md)
147+
* [Install Triton](docs/quickstart.md#install-triton-docker-image)
148+
* [Create Model Repository](docs/quickstart.md#create-a-model-repository)
149+
* [Run Triton](docs/quickstart.md#run-triton)
150+
* [Model Repository](docs/model_repository.md)
151+
* [Cloud Storage](docs/model_repository.md#model-repository-locations)
152+
* [File Organization](docs/model_repository.md#model-files)
153+
* [Model Versioning](docs/model_repository.md#model-versions)
154+
* [Model Configuration](docs/model_configuration.md)
155+
* [Required Model Configuration](docs/model_configuration.md#minimal-model-configuration)
156+
* [Maximum Batch Size - Batching and Non-Batching Models](docs/model_configuration.md#maximum-batch-size)
157+
* [Input and Output Tensors](docs/model_configuration.md#inputs-and-outputs)
158+
* [Tensor Datatypes](docs/model_configuration.md#datatypes)
159+
* [Tensor Reshape](docs/model_configuration.md#reshape)
160+
* [Shape Tensor](docs/model_configuration.md#shape-tensors)
161+
* [Auto-Generate Required Model Configuration](docs/model_configuration.md#auto-generated-model-configuration)
162+
* [Version Policy](docs/model_configuration.md#version-policy)
163+
* [Instance Groups](docs/model_configuration.md#instance-groups)
164+
* [Specifying Multiple Model Instances](docs/model_configuration.md#multiple-model-instances)
165+
* [CPU and GPU Instances](docs/model_configuration.md#cpu-model-instance)
166+
* [Configuring Rate Limiter](docs/model_configuration.md#rate-limiter-configuration)
167+
* [Optimization Settings](docs/model_configuration.md#optimization_policy)
168+
* [Framework-Specific Optimization](docs/optimization.md#framework-specific-optimization)
169+
* [ONNX-TensorRT](docs/optimization.md#onnx-with-tensorrt-optimization)
170+
* [ONNX-OpenVINO](docs/optimization.md#onnx-with-openvino-optimization)
171+
* [TensorFlow-TensorRT](docs/optimization.md#tensorflow-with-tensorrt-optimization)
172+
* [TensorFlow-Mixed-Precision](docs/optimization.md#tensorflow-automatic-fp16-optimization)
173+
* [NUMA Optimization](docs/optimization.md#numa-optimization)
174+
* [Scheduling and Batching](docs/model_configuration.md#scheduling-and-batching)
175+
* [Default Scheduler - Non-Batching](docs/model_configuration.md#default-scheduler)
176+
* [Dynamic Batcher](docs/model_configuration.md#dynamic-batcher)
177+
* [How to Configure Dynamic Batcher](docs/model_configuration.md#recommended-configuration-process)
178+
* [Delayed Batching](docs/model_configuration.md#delayed-batching)
179+
* [Preferred Batch Size](docs/model_configuration.md#preferred-batch-sizes)
180+
* [Preserving Request Ordering](docs/model_configuration.md#preserve-ordering)
181+
* [Priority Levels](docs/model_configuration.md#priority-levels)
182+
* [Queuing Policies](docs/model_configuration.md#queue-policy)
183+
* [Ragged Batching](docs/ragged_batching.md)
184+
* [Sequence Batcher](docs/model_configuration.md#sequence-batcher)
185+
* [Stateful Models](docs/architecture.md#stateful-models)
186+
* [Control Inputs](docs/architecture.md#control-inputs)
187+
* [Implicit State - Stateful Inference Using a Stateless Model](docs/architecture.md#implicit-state-management)
188+
* [Sequence Scheduling Strategies](docs/architecture.md#scheduling-strateties)
189+
* [Direct](docs/architecture.md#direct)
190+
* [Oldest](docs/architecture.md#oldest)
191+
* [Rate Limiter](docs/rate_limiter.md)
192+
* [Model Warmup](docs/model_configuration.md#model-warmup)
193+
* [Inference Request/Response Cache](docs/model_configuration.md#response-cache)
194+
* Model Pipeline
195+
* [Model Ensemble](docs/architecture.md#ensemble-models)
196+
* [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
197+
* [Model Management](docs/model_management.md)
198+
* [Explicit Model Loading and Unloading](docs/model_management.md#model-control-mode-explicit)
199+
* [Modifying the Model Repository](docs/model_management.md#modifying-the-model-repository)
200+
* [Metrics](docs/metrics.md)
201+
* [Framework Custom Operations](docs/custom_operations.md)
202+
* [TensorRT](docs/custom_operations.md#tensorrt)
203+
* [TensorFlow](docs/custom_operations.md#tensorflow)
204+
* [PyTorch](docs/custom_operations.md#pytorch)
205+
* [ONNX](docs/custom_operations.md#onnx)
206+
* [Client Libraries and Examples](https://github.com/triton-inference-server/client)
207+
* [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
208+
* [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis)
209+
* [Java HTTP Library](https://github.com/triton-inference-server/client/src/java)
210+
* GRPC Generated Libraries
211+
* [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go)
212+
* [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java)
213+
* [Performance Analysis](docs/optimization.md)
214+
* [Model Analyzer](docs/model_analyzer.md)
215+
* [Performance Analyzer](docs/perf_analyzer.md)
216+
* [Inference Request Tracing](docs/trace.md)
217+
* [Jetson and JetPack](docs/jetson.md)
218+
219+
The [quickstart](docs/quickstart.md) walks you through all the steps
220+
required to install and run Triton with an example image
221+
classification model and then use an example client application to
222+
perform inferencing using that model. The quickstart also demonstrates
223+
how [Triton supports both GPU systems and CPU-only
224+
systems](docs/quickstart.md#run-triton).
225+
226+
The first step in using Triton to serve your models is to place one or
227+
more models into a [model
228+
repository](docs/model_repository.md). Optionally, depending on the type
229+
of the model and on what Triton capabilities you want to enable for
230+
the model, you may need to create a [model
231+
configuration](docs/model_configuration.md) for the model. If your
232+
model has [custom operations](docs/custom_operations.md) you will need
233+
to make sure they are loaded correctly by Triton.
234+
235+
After you have your model(s) available in Triton, you will want to
236+
send inference and other requests to Triton from your *client*
237+
application. The [Python and C++ client
238+
libraries](https://github.com/triton-inference-server/client) provide
239+
APIs to simplify this communication. There are also a large number of
240+
[client examples](https://github.com/triton-inference-server/client)
241+
that demonstrate how to use the libraries. You can also send
242+
HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based
243+
protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
244+
[generate a GRPC client for many other
245+
languages](https://github.com/triton-inference-server/client).
246+
247+
Understanding and [optimizing performance](docs/optimization.md) is an
248+
important part of deploying your models. The Triton project provides
249+
the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
250+
Analyzer](docs/model_analyzer.md) to help your optimization
251+
efforts. Specifically, you will want to optimize [scheduling and
252+
batching](docs/architecture.md#models-and-schedulers) and [model
253+
instances](docs/model_configuration.md#instance-groups) appropriately
254+
for each model. You can also enable cross-model prioritization using
255+
the [rate limiter](docs/rate_limiter.md) which manages the rate at
256+
which requests are scheduled on model instances. You may also want to
257+
consider combining multiple models and pre/post-processing into a
258+
pipeline using [ensembling](docs/architecture.md#ensemble-models) or
259+
[Business Logic Scripting
260+
(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting). A
261+
[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
262+
and monitor aggregate inference metrics.
263+
264+
NVIDIA publishes a number of [deep learning
265+
examples](https://github.com/NVIDIA/DeepLearningExamples) that use
266+
Triton.
267+
268+
As part of your deployment strategy you may want to [explicitly manage
269+
what models are available by loading and unloading
270+
models](docs/model_management.md) from a running Triton server. If you
271+
are using Kubernetes for deployment there are simple examples of how
272+
to deploy Triton using Kubernetes and Helm:
273+
[GCP](deploy/gcp/README.md), [AWS](deploy/aws/README.md), and [NVIDIA
274+
FleetCommand](deploy/fleetcommand/README.md)
275+
276+
The [version 1 to version 2 migration
277+
information](docs/v1_to_v2.md) is helpful if you are moving to
278+
version 2 of Triton from previously using version 1.
279+
280+
### Developer Documentation
281+
282+
* [Build](docs/build.md)
283+
* [Protocols and APIs](docs/inference_protocols.md).
284+
* [Backends](https://github.com/triton-inference-server/backend)
285+
* [Repository Agents](docs/repository_agents.md)
286+
* [Test](docs/test.md)
287+
288+
Triton can be [built using
289+
Docker](docs/build.md#building-triton-with-docker) or [built without
290+
Docker](docs/build.md#building-triton-without-docker). After building
291+
you should [test Triton](docs/test.md).
292+
293+
It is also possible to [create a Docker image containing a customized
294+
Triton](docs/compose.md) that contains only a subset of the backends.
295+
296+
The Triton project also provides [client libraries for Python and
297+
C++](https://github.com/triton-inference-server/client) that make it
298+
easy to communicate with the server. There are also a large number of
299+
[example clients](https://github.com/triton-inference-server/client)
300+
that demonstrate how to use the libraries. You can also develop your
301+
own clients that directly communicate with Triton using [HTTP/REST or
302+
GRPC protocols](docs/inference_protocols.md). There is also a [C
303+
API](docs/inference_protocols.md) that allows Triton to be linked
304+
directly into your application.
305+
306+
A [Triton backend](https://github.com/triton-inference-server/backend)
307+
is the implementation that executes a model. A backend can interface
308+
with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
309+
ONNX Runtime; or it can interface with a data processing framework
310+
like [DALI](https://github.com/triton-inference-server/dali_backend);
311+
or you can extend Triton by [writing your own
312+
backend](https://github.com/triton-inference-server/backend) in either
313+
[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
314+
or
315+
[Python](https://github.com/triton-inference-server/python_backend).
316+
317+
A [Triton repository agent](docs/repository_agents.md) extends Triton
318+
with new functionality that operates when a model is loaded or
319+
unloaded. You can introduce your own code to perform authentication,
320+
decryption, conversion, or similar operations when a model is loaded.
321+
322+
## Papers and Presentation
323+
324+
* [Maximizing Deep Learning Inference Performance with NVIDIA Model
325+
Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
326+
327+
* [High-Performance Inferencing at Scale Using the TensorRT Inference
328+
Server](https://developer.nvidia.com/gtc/2020/video/s22418).
329+
330+
* [Accelerate and Autoscale Deep Learning Inference on GPUs with
331+
KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
332+
333+
* [Deep into Triton Inference Server: BERT Practical Deployment on
334+
NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
335+
336+
* [Maximizing Utilization for Data Center Inference with TensorRT
337+
Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
338+
339+
* [NVIDIA TensorRT Inference Server Boosts Deep Learning
340+
Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
341+
342+
* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
343+
Inference Server and
344+
Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
345+
346+
* [Deploying NVIDIA Triton at Scale with MIG and Kubernetes](https://developer.nvidia.com/blog/deploying-nvidia-triton-at-scale-with-mig-and-kubernetes/).
347+
348+
## Contributing
349+
350+
Contributions to Triton Inference Server are more than welcome. To
351+
contribute make a pull request and follow the guidelines outlined in
352+
[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
353+
example or similar contribution that is not modifying the core of
354+
Triton, then you should file a PR in the [contrib
355+
repo](https://github.com/triton-inference-server/contrib).
356+
357+
## Reporting problems, asking questions
358+
359+
We appreciate any feedback, questions or bug reporting regarding this
360+
project. When help with code is needed, follow the process outlined in
361+
the Stack Overflow (<https://stackoverflow.com/help/mcve>)
362+
document. Ensure posted examples are:
363+
364+
* minimal – use as little code as possible that still produces the
365+
same problem
366+
367+
* complete – provide all parts needed to reproduce the problem. Check
368+
if you can strip external dependency and still show the problem. The
369+
less time we spend on reproducing problems the more time we have to
370+
fix it
371+
372+
* verifiable – test the code you're about to provide to make sure it
373+
reproduces the problem. Remove all other problems that are not
374+
related to your request/question.

0 commit comments

Comments
 (0)