Skip to content

Commit 3159e73

Browse files
committed
Update README and versions for 20.09 release
1 parent 0808a2a commit 3159e73

File tree

3 files changed

+244
-9
lines changed

3 files changed

+244
-9
lines changed

Dockerfile

+4-4
Original file line numberDiff line numberDiff line change
@@ -146,8 +146,8 @@ FROM ${TENSORFLOW2_IMAGE} AS tritonserver_tf2
146146
############################################################################
147147
FROM ${BASE_IMAGE} AS tritonserver_build
148148

149-
ARG TRITON_VERSION=2.3.0dev
150-
ARG TRITON_CONTAINER_VERSION=20.09dev
149+
ARG TRITON_VERSION=2.3.0
150+
ARG TRITON_CONTAINER_VERSION=20.09
151151

152152
# libgoogle-glog0v5 is needed by caffe2 libraries.
153153
# libcurl4-openSSL-dev is needed for GCS
@@ -374,8 +374,8 @@ ENTRYPOINT ["/opt/tritonserver/nvidia_entrypoint.sh"]
374374
############################################################################
375375
FROM ${BASE_IMAGE}
376376

377-
ARG TRITON_VERSION=2.3.0dev
378-
ARG TRITON_CONTAINER_VERSION=20.09dev
377+
ARG TRITON_VERSION=2.3.0
378+
ARG TRITON_CONTAINER_VERSION=20.09
379379

380380
ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
381381
ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}

README.rst

+239-4
Original file line numberDiff line numberDiff line change
@@ -30,13 +30,248 @@
3030
Triton Inference Server
3131
=======================
3232

33-
**NOTE: You are currently on the r20.09 branch which tracks
34-
stabilization towards teh next release. This branch is not usable
35-
during stabilization.**
36-
3733
.. overview-begin-marker-do-not-remove
3834
35+
Triton Inference Server provides a cloud inferencing solution
36+
optimized for both CPUs and GPUs. Triton provides an inference service
37+
via an HTTP/REST or GRPC endpoint, allowing remote clients to request
38+
inferencing for any model being managed by the server. For edge
39+
deployments, Triton is also available as a shared library with a C API
40+
that allows the full functionality of Triton to be included directly
41+
in an application.
42+
43+
What's New In 2.3.0
44+
-------------------
45+
46+
* Python Client library is now a pip package available from the NVIDIA pypi
47+
index. See
48+
https://github.com/triton-inference-server/server/blob/master/src/clients/python/library/README.md
49+
for more information.
50+
51+
* Fix a performance issue with the HTTP/REST protocol and the Python client
52+
library that caused reduced performance when outputs were not requested
53+
explicitly in an inference request.
54+
55+
* Fix some bugs in reporting of statistics for ensemble models.
56+
57+
* GRPC updated to version 1.25.0.
58+
59+
Features
60+
--------
61+
62+
* `Multiple framework support
63+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#framework-model-definition>`_. Triton
64+
can manage any number and mix of models (limited by system disk and
65+
memory resources). Supports TensorRT, TensorFlow GraphDef,
66+
TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model
67+
formats. Both TensorFlow 1.x and TensorFlow 2.x are supported. Also
68+
supports TensorFlow-TensorRT and ONNX-TensorRT integrated
69+
models. Variable-size input and output tensors are allowed if
70+
supported by the framework. See `Capabilities
71+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/capabilities.html#capabilities>`_
72+
for information for each framework.
73+
74+
* `Concurrent model execution support
75+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html#instance-groups>`_. Multiple
76+
models (or multiple instances of the same model) can run
77+
simultaneously on the same GPU.
78+
79+
* Batching support. For models that support batching, Triton can
80+
accept requests for a batch of inputs and respond with the
81+
corresponding batch of outputs. Triton also supports multiple
82+
`scheduling and batching
83+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html#scheduling-and-batching>`_
84+
algorithms that combine individual inference requests together to
85+
improve inference throughput. These scheduling and batching
86+
decisions are transparent to the client requesting inference.
87+
88+
* `Custom backend support
89+
<https://github.com/triton-inference-server/server/blob/master/docs/backend.rst>`_. Triton
90+
allows individual models to be implemented with custom backends
91+
instead of by a deep-learning framework. With a custom backend a
92+
model can implement any logic desired, while still benefiting from
93+
the CPU and GPU support, concurrent execution, dynamic batching and
94+
other features provided by Triton.
95+
96+
* `Ensemble support
97+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/models_and_schedulers.html#ensemble-models>`_. An
98+
ensemble represents a pipeline of one or more models and the
99+
connection of input and output tensors between those models. A
100+
single inference request to an ensemble will trigger the execution
101+
of the entire pipeline.
102+
103+
* Multi-GPU support. Triton can distribute inferencing across all
104+
system GPUs.
105+
106+
* Triton provides `multiple modes for model management
107+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_management.html>`_. These
108+
model management modes allow for both implicit and explicit loading
109+
and unloading of models without requiring a server restart.
110+
111+
* `Model repositories
112+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#>`_
113+
may reside on a locally accessible file system (e.g. NFS), in Google
114+
Cloud Storage or in Amazon S3.
115+
116+
* HTTP/REST and GRPC `inference protocols
117+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_
118+
based on the community developed `KFServing protocol
119+
<https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2>`_.
120+
121+
* Readiness and liveness `health endpoints
122+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_
123+
suitable for any orchestration or deployment framework, such as
124+
Kubernetes.
125+
126+
* `Metrics
127+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/metrics.html>`_
128+
indicating GPU utilization, server throughput, and server
129+
latency. The metrics are provided in Prometheus data format.
130+
131+
* `C library inferface
132+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/library_api.html>`_
133+
allows the full functionality of Triton to be included directly in
134+
an application.
135+
39136
.. overview-end-marker-do-not-remove
40137
138+
The current release of the Triton Inference Server is 2.2.0 and
139+
corresponds to the 20.08 release of the tensorrtserver container on
140+
`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for
141+
this release is `r20.08
142+
<https://github.com/triton-inference-server/server/tree/r20.08>`_.
143+
144+
Backwards Compatibility
145+
-----------------------
146+
147+
Version 2 of Triton is beta quality, so you should expect some changes
148+
to the server and client protocols and APIs. Version 2 of Triton does
149+
not generally maintain backwards compatibility with version 1.
150+
Specifically, you should take the following items into account when
151+
transitioning from version 1 to version 2:
152+
153+
* The Triton executables and libraries are in /opt/tritonserver. The
154+
Triton executable is /opt/tritonserver/bin/tritonserver.
155+
156+
* Some *tritonserver* command-line arguments are removed, changed or
157+
have different default behavior in version 2.
158+
159+
* --api-version, --http-health-port, --grpc-infer-thread-count,
160+
--grpc-stream-infer-thread-count,--allow-poll-model-repository, --allow-model-control
161+
and --tf-add-vgpu are removed.
162+
163+
* The default for --model-control-mode is changed to *none*.
164+
165+
* --tf-allow-soft-placement and --tf-gpu-memory-fraction are renamed
166+
to --backend-config="tensorflow,allow-soft-placement=<true,false>"
167+
and --backend-config="tensorflow,gpu-memory-fraction=<float>".
168+
169+
* The HTTP/REST and GRPC protocols, while conceptually similar to
170+
version 1, are completely changed in version 2. See the `inference
171+
protocols
172+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_
173+
section of the documentation for more information.
174+
175+
* Python and C++ client libraries are re-implemented to match the new
176+
HTTP/REST and GRPC protocols. The Python client no longer depends on
177+
a C++ shared library and so should be usable on any platform that
178+
supports Python. See the `client libraries
179+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_library.html>`_
180+
section of the documentaion for more information.
181+
182+
* The version 2 cmake build requires these changes:
183+
184+
* The cmake flag names have changed from having a TRTIS prefix to
185+
having a TRITON prefix. For example, TRITON_ENABLE_TENSORRT.
186+
187+
* The build targets are *server*, *client* and *custom-backend* to
188+
build the server, client libraries and examples, and custom
189+
backend SDK, respectively.
190+
191+
* In the Docker containers the environment variables indicating the
192+
Triton version have changed to have a TRITON prefix, for example,
193+
TRITON_SERVER_VERSION.
194+
195+
Documentation
196+
-------------
197+
198+
The User Guide, Developer Guide, and API Reference `documentation for
199+
the current release
200+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html>`_
201+
provide guidance on installing, building, and running Triton Inference
202+
Server.
203+
204+
You can also view the `documentation for the master branch
205+
<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/index.html>`_
206+
and for `earlier releases
207+
<https://docs.nvidia.com/deeplearning/triton-inference-server/archives/index.html>`_.
208+
209+
NVIDIA publishes a number of `deep learning examples that use Triton
210+
<https://github.com/NVIDIA/DeepLearningExamples>`_.
211+
212+
An `FAQ
213+
<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/faq.html>`_
214+
provides answers for frequently asked questions.
215+
216+
READMEs for deployment examples can be found in subdirectories of
217+
deploy/, for example, `deploy/single_server/README.rst
218+
<https://github.com/triton-inference-server/server/tree/master/deploy/single_server/README.rst>`_.
219+
220+
The `Release Notes
221+
<https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html>`_
222+
and `Support Matrix
223+
<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_
224+
indicate the required versions of the NVIDIA Driver and CUDA, and also
225+
describe which GPUs are supported by Triton.
226+
227+
Presentations and Papers
228+
^^^^^^^^^^^^^^^^^^^^^^^^
229+
230+
* `Maximizing Deep Learning Inference Performance with NVIDIA Model Analyzer <https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/>`_.
231+
232+
* `High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>`_.
233+
234+
* `Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>`_.
235+
236+
* `Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>`_.
237+
238+
* `Maximizing Utilization for Data Center Inference with TensorRT
239+
Inference Server
240+
<https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_.
241+
242+
* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference
243+
<https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_.
244+
245+
* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
246+
Inference Server and Kubeflow
247+
<https://www.kubeflow.org/blog/nvidia_tensorrt/>`_.
248+
249+
Contributing
250+
------------
251+
252+
Contributions to Triton Inference Server are more than welcome. To
253+
contribute make a pull request and follow the guidelines outlined in
254+
the `Contributing <CONTRIBUTING.md>`_ document.
255+
256+
Reporting problems, asking questions
257+
------------------------------------
258+
259+
We appreciate any feedback, questions or bug reporting regarding this
260+
project. When help with code is needed, follow the process outlined in
261+
the Stack Overflow (https://stackoverflow.com/help/mcve)
262+
document. Ensure posted examples are:
263+
264+
* minimal – use as little code as possible that still produces the
265+
same problem
266+
267+
* complete – provide all parts needed to reproduce the problem. Check
268+
if you can strip external dependency and still show the problem. The
269+
less time we spend on reproducing problems the more time we have to
270+
fix it
271+
272+
* verifiable – test the code you're about to provide to make sure it
273+
reproduces the problem. Remove all other problems that are not
274+
related to your request/question.
275+
41276
.. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
42277
:target: https://opensource.org/licenses/BSD-3-Clause

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.3.0dev
1+
2.3.0

0 commit comments

Comments
 (0)