Skip to content

Commit b431e28

Browse files
committed
Update README and versions for 20.11 release
1 parent 09e386d commit b431e28

File tree

2 files changed

+255
-7
lines changed

2 files changed

+255
-7
lines changed

README.md

+255-2
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,258 @@
3030

3131
# Triton Inference Server
3232

33-
**NOTE: You are currently on the r20.11 branch which tracks stabilization
34-
towards the next release. This branch is not usable during stabilization.**
33+
Triton Inference Server provides a cloud and edge inferencing solution
34+
optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
35+
GRPC protocol that allows remote clients to request inferencing for
36+
any model being managed by the server. For edge deployments, Triton is
37+
available as a shared library with a C API that allows the full
38+
functionality of Triton to be included directly in an
39+
application.
40+
41+
## What's New in 2.5.0
42+
43+
* ONNX Runtime backend updated to use ONNX Runtime 1.5.3.
44+
45+
* The PyTorch backend is moved to a dedicated repo
46+
triton-inference-server/pytorch_backend.
47+
48+
* The Caffe2 backend is removed. Caffe2 models are no longer supported.
49+
50+
* Fix handling of failed model reloads. If a model reload fails, the currently
51+
loaded version of the model will remain loaded and its availability will be uninterrupted.
52+
53+
* Releasing Triton ModelAnalyzer in the Triton SDK container and as a PIP
54+
package available in NVIDIA PyIndex.
55+
56+
## Features
57+
58+
* [Multiple deep-learning
59+
frameworks](https://github.com/triton-inference-server/backend). Triton
60+
can manage any number and mix of models (limited by system disk and
61+
memory resources). Triton supports TensorRT, TensorFlow GraphDef,
62+
TensorFlow SavedModel, ONNX, and PyTorch TorchScript model
63+
formats. Both TensorFlow 1.x and TensorFlow 2.x are
64+
supported. Triton also supports TensorFlow-TensorRT and
65+
ONNX-TensorRT integrated models.
66+
67+
* [Concurrent model
68+
execution](docs/architecture.md#concurrent-model-execution). Multiple
69+
models (or multiple instances of the same model) can run
70+
simultaneously on the same GPU or on multiple GPUs.
71+
72+
* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
73+
models that support batching, Triton implements multiple scheduling
74+
and batching algorithms that combine individual inference requests
75+
together to improve inference throughput. These scheduling and
76+
batching decisions are transparent to the client requesting
77+
inference.
78+
79+
* [Extensible
80+
backends](https://github.com/triton-inference-server/backend). In
81+
addition to deep-learning frameworks, Triton provides a *backend
82+
API* that allows Triton to be extended with any model execution
83+
logic implemented in
84+
[Python](https://github.com/triton-inference-server/python_backend)
85+
or
86+
[C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
87+
while still benefiting from the CPU and GPU support, concurrent
88+
execution, dynamic batching and other features provided by Triton.
89+
90+
* [Model pipelines](docs/architecture.md#ensemble-models). Triton
91+
*ensembles* represents a pipeline of one or more models and the
92+
connection of input and output tensors between those models. A
93+
single inference request to an ensemble will trigger the execution
94+
of the entire pipeline.
95+
96+
* [HTTP/REST and GRPC inference
97+
protocols](docs/inference_protocols.md) based on the community
98+
developed [KFServing
99+
protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
100+
101+
* [Metrics](docs/metrics.md) indicating GPU utilization, server
102+
throughput, and server latency. The metrics are provided in
103+
Prometheus data format.
104+
105+
## Documentation
106+
107+
[Triton Architecture](docs/architecture.md) gives a high-level
108+
overview of the structure and capabilities of the inference
109+
server. There is also an [FAQ](docs/faq.md). Additional documentation
110+
is divided into [*user*](#user-documentation) and
111+
[*developer*](#developer-documentation) sections. The *user*
112+
documentation describes how to use Triton as an inference solution,
113+
including information on how to configure Triton, how to organize and
114+
configure your models, how to use the C++ and Python clients, etc. The
115+
*developer* documentation describes how to build and test Triton and
116+
also how Triton can be extended with new functionality.
117+
118+
The Triton [Release
119+
Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
120+
and [Support
121+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
122+
indicate the required versions of the NVIDIA Driver and CUDA, and also
123+
describe supported GPUs.
124+
125+
### User Documentation
126+
127+
- [QuickStart](docs/quickstart.md)
128+
- [Install](docs/quickstart.md#install-triton-docker-image)
129+
- [Run](docs/quickstart.md#run-triton)
130+
- [Model Repository](docs/model_repository.md)
131+
- [Model Configuration](docs/model_configuration.md)
132+
- [Model Management](docs/model_management.md)
133+
- [Custom Operations](docs/custom_operations.md)
134+
- [Client Libraries](docs/client_libraries.md)
135+
- [Client Examples](docs/client_examples.md)
136+
- [Optimization](docs/optimization.md)
137+
- [Model Analyzer](docs/model_analyzer.md)
138+
- [Performance Analyzer](docs/perf_analyzer.md)
139+
- [Metrics](docs/metrics.md)
140+
141+
The [quickstart](docs/quickstart.md) walks you through all the steps
142+
required to install and run Triton with an example image
143+
classification model and then use an example client application to
144+
perform inferencing using that model. The quickstart also demonstrates
145+
how [Triton supports both GPU systems and CPU-only
146+
systems](docs/quickstart.md#run-triton).
147+
148+
The first step in using Triton to serve your models is to place one or
149+
more models into a [model
150+
repository](docs/model_repository.md). Optionally, depending on the type
151+
of the model and on what Triton capabilities you want to enable for
152+
the model, you may need to create a [model
153+
configuration](docs/model_configuration.md) for the model. If your
154+
model has [custom operations](docs/custom_operations.md) you will need
155+
to make sure they are loaded correctly by Triton.
156+
157+
After you have your model(s) available in Triton, you will want to
158+
send inference and other requests to Triton from your *client*
159+
application. The [Python and C++ client
160+
libraries](docs/client_libraries.md) provide
161+
[APIs](docs/client_libraries.md#client-library-apis) to simplify this
162+
communication. There are also a large number of [client
163+
examples](docs/client_examples.md) that demonstrate how to use the
164+
libraries. You can also send HTTP/REST requests directly to Triton
165+
using the [HTTP/REST JSON-based
166+
protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
167+
[generate a GRPC client for many other
168+
languages](docs/client_libraries.md).
169+
170+
Understanding and [optimizing performance](docs/optimization.md) is an
171+
important part of deploying your models. The Triton project provides
172+
the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
173+
Analyzer](docs/model_analyzer.md) to help your optimization
174+
efforts. Specifically, you will want to optimize [scheduling and
175+
batching](docs/architecture.md#models-and-schedulers) and [model
176+
instances](docs/model_configuration.md#instance-groups) appropriately
177+
for each model. You may also want to consider [ensembling multiple
178+
models and pre/post-processing](docs/architecture.md#ensemble-models)
179+
into a pipeline. In some cases you may find [individual inference
180+
request trace data](docs/trace.md) useful when optimizing. A
181+
[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
182+
and monitor aggregate inference metrics.
183+
184+
NVIDIA publishes a number of [deep learning
185+
examples](https://github.com/NVIDIA/DeepLearningExamples) that use
186+
Triton.
187+
188+
As part of you deployment strategy you may want to [explicitly manage
189+
what models are available by loading and unloading
190+
models](docs/model_management.md) from a running Triton server. If you
191+
are using Kubernetes for deployment a simple example of how to [deploy
192+
Triton using Kubernetes and Helm](deploy/single_server/README.rst) may
193+
be helpful.
194+
195+
The [version 1 to version 2 migration
196+
information](docs/v1_to_v2.md) is helpful if you are moving to
197+
version 2 of Triton from previously using version 1.
198+
199+
### Developer Documentation
200+
201+
- [Build](docs/build.md)
202+
- [Protocols and APIs](docs/inference_protocols.md).
203+
- [Backends](https://github.com/triton-inference-server/backend)
204+
- [Test](docs/test.md)
205+
206+
Triton can be [built using
207+
Docker](docs/build.md#building-triton-with-docker) or [built without
208+
Docker](docs/build.md#building-triton-without-docker). After building
209+
you should [test Triton](docs/test.md).
210+
211+
Starting with the r20.10 release, it is also possible to [create a
212+
Docker image containing a customized Triton](docs/compose.md) that
213+
contains only a subset of the backends.
214+
215+
The Triton project also provides [client libraries for Python and
216+
C++](docs/client_libraries.md) that make it easy to communicate with
217+
the server. There are also a large number of [example
218+
clients](docs/client_examples.md) that demonstrate how to use the
219+
libraries. You can also develop your own clients that directly
220+
communicate with Triton using [HTTP/REST or GRPC
221+
protocols](docs/inference_protocols.md). There is also a [C
222+
API](docs/inference_protocols.md) that allows Triton to be linked
223+
directly into your application.
224+
225+
A [Triton backend](https://github.com/triton-inference-server/backend)
226+
is the implementation that executes a model. A backend can interface
227+
with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
228+
ONNX Runtime; or it can interface with a data processing framework
229+
like [DALI](https://github.com/triton-inference-server/dali_backend);
230+
or it can be custom
231+
[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
232+
or [Python](https://github.com/triton-inference-server/python_backend)
233+
code for performing any operation. You can even extend Triton by
234+
[writing your own
235+
backend](https://github.com/triton-inference-server/backend).
236+
237+
## Papers and Presentation
238+
239+
* [Maximizing Deep Learning Inference Performance with NVIDIA Model
240+
Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
241+
242+
* [High-Performance Inferencing at Scale Using the TensorRT Inference
243+
Server](https://developer.nvidia.com/gtc/2020/video/s22418).
244+
245+
* [Accelerate and Autoscale Deep Learning Inference on GPUs with
246+
KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
247+
248+
* [Deep into Triton Inference Server: BERT Practical Deployment on
249+
NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
250+
251+
* [Maximizing Utilization for Data Center Inference with TensorRT
252+
Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
253+
254+
* [NVIDIA TensorRT Inference Server Boosts Deep Learning
255+
Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
256+
257+
* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
258+
Inference Server and
259+
Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
260+
261+
## Contributing
262+
263+
Contributions to Triton Inference Server are more than welcome. To
264+
contribute make a pull request and follow the guidelines outlined in
265+
[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
266+
example or similar contribution that is not modifying the core of
267+
Triton, then you should file a PR in the [contrib
268+
repo](https://github.com/triton-inference-server/contrib).
269+
270+
## Reporting problems, asking questions
271+
272+
We appreciate any feedback, questions or bug reporting regarding this
273+
project. When help with code is needed, follow the process outlined in
274+
the Stack Overflow (https://stackoverflow.com/help/mcve)
275+
document. Ensure posted examples are:
276+
277+
* minimal – use as little code as possible that still produces the
278+
same problem
279+
280+
* complete – provide all parts needed to reproduce the problem. Check
281+
if you can strip external dependency and still show the problem. The
282+
less time we spend on reproducing problems the more time we have to
283+
fix it
284+
285+
* verifiable – test the code you're about to provide to make sure it
286+
reproduces the problem. Remove all other problems that are not
287+
related to your request/question.

build.py

-5
Original file line numberDiff line numberDiff line change
@@ -57,11 +57,6 @@
5757
# ORT openvino version
5858
# )
5959
TRITON_VERSION_MAP = {
60-
'2.2.0': ('20.08', '20.08', '1.4.0', '2020.2'),
61-
'2.3.0': ('20.09', '20.09', '1.4.0', '2020.2'),
62-
'2.4.0dev': ('20.09', '20.09', '1.4.0', '2020.2'),
63-
'2.4.0': ('20.10', '20.10', '1.4.0', '2020.2'),
64-
'2.5.0dev': ('20.11dev', '20.10', '1.5.3', '2020.4'),
6560
'2.5.0': ('20.11', '20.11', '1.5.3', '2020.4')
6661
}
6762

0 commit comments

Comments
 (0)