|
30 | 30 | Triton Inference Server
|
31 | 31 | =======================
|
32 | 32 |
|
33 |
| - **NOTE: You are currently on the r20.09 branch which tracks |
34 |
| - stabilization towards teh next release. This branch is not usable |
35 |
| - during stabilization.** |
36 |
| - |
37 | 33 | .. overview-begin-marker-do-not-remove
|
38 | 34 |
|
| 35 | +Triton Inference Server provides a cloud inferencing solution |
| 36 | +optimized for both CPUs and GPUs. Triton provides an inference service |
| 37 | +via an HTTP/REST or GRPC endpoint, allowing remote clients to request |
| 38 | +inferencing for any model being managed by the server. For edge |
| 39 | +deployments, Triton is also available as a shared library with a C API |
| 40 | +that allows the full functionality of Triton to be included directly |
| 41 | +in an application. |
| 42 | + |
| 43 | +What's New In 2.3.0 |
| 44 | +------------------- |
| 45 | + |
| 46 | +* Python Client library is now a pip package available from the NVIDIA pypi |
| 47 | + index. See |
| 48 | + https://github.com/triton-inference-server/server/blob/master/src/clients/python/library/README.md |
| 49 | + for more information. |
| 50 | + |
| 51 | +* Fix a performance issue with the HTTP/REST protocol and the Python client |
| 52 | + library that caused reduced performance when outputs were not requested |
| 53 | + explicitly in an inference request. |
| 54 | + |
| 55 | +* Fix some bugs in reporting of statistics for ensemble models. |
| 56 | + |
| 57 | +* GRPC updated to version 1.25.0. |
| 58 | + |
| 59 | +Features |
| 60 | +-------- |
| 61 | + |
| 62 | +* `Multiple framework support |
| 63 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#framework-model-definition>`_. Triton |
| 64 | + can manage any number and mix of models (limited by system disk and |
| 65 | + memory resources). Supports TensorRT, TensorFlow GraphDef, |
| 66 | + TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model |
| 67 | + formats. Both TensorFlow 1.x and TensorFlow 2.x are supported. Also |
| 68 | + supports TensorFlow-TensorRT and ONNX-TensorRT integrated |
| 69 | + models. Variable-size input and output tensors are allowed if |
| 70 | + supported by the framework. See `Capabilities |
| 71 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/capabilities.html#capabilities>`_ |
| 72 | + for information for each framework. |
| 73 | + |
| 74 | +* `Concurrent model execution support |
| 75 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html#instance-groups>`_. Multiple |
| 76 | + models (or multiple instances of the same model) can run |
| 77 | + simultaneously on the same GPU. |
| 78 | + |
| 79 | +* Batching support. For models that support batching, Triton can |
| 80 | + accept requests for a batch of inputs and respond with the |
| 81 | + corresponding batch of outputs. Triton also supports multiple |
| 82 | + `scheduling and batching |
| 83 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_configuration.html#scheduling-and-batching>`_ |
| 84 | + algorithms that combine individual inference requests together to |
| 85 | + improve inference throughput. These scheduling and batching |
| 86 | + decisions are transparent to the client requesting inference. |
| 87 | + |
| 88 | +* `Custom backend support |
| 89 | + <https://github.com/triton-inference-server/server/blob/master/docs/backend.rst>`_. Triton |
| 90 | + allows individual models to be implemented with custom backends |
| 91 | + instead of by a deep-learning framework. With a custom backend a |
| 92 | + model can implement any logic desired, while still benefiting from |
| 93 | + the CPU and GPU support, concurrent execution, dynamic batching and |
| 94 | + other features provided by Triton. |
| 95 | + |
| 96 | +* `Ensemble support |
| 97 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/models_and_schedulers.html#ensemble-models>`_. An |
| 98 | + ensemble represents a pipeline of one or more models and the |
| 99 | + connection of input and output tensors between those models. A |
| 100 | + single inference request to an ensemble will trigger the execution |
| 101 | + of the entire pipeline. |
| 102 | + |
| 103 | +* Multi-GPU support. Triton can distribute inferencing across all |
| 104 | + system GPUs. |
| 105 | + |
| 106 | +* Triton provides `multiple modes for model management |
| 107 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_management.html>`_. These |
| 108 | + model management modes allow for both implicit and explicit loading |
| 109 | + and unloading of models without requiring a server restart. |
| 110 | + |
| 111 | +* `Model repositories |
| 112 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html#>`_ |
| 113 | + may reside on a locally accessible file system (e.g. NFS), in Google |
| 114 | + Cloud Storage or in Amazon S3. |
| 115 | + |
| 116 | +* HTTP/REST and GRPC `inference protocols |
| 117 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_ |
| 118 | + based on the community developed `KFServing protocol |
| 119 | + <https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2>`_. |
| 120 | + |
| 121 | +* Readiness and liveness `health endpoints |
| 122 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_ |
| 123 | + suitable for any orchestration or deployment framework, such as |
| 124 | + Kubernetes. |
| 125 | + |
| 126 | +* `Metrics |
| 127 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/metrics.html>`_ |
| 128 | + indicating GPU utilization, server throughput, and server |
| 129 | + latency. The metrics are provided in Prometheus data format. |
| 130 | + |
| 131 | +* `C library inferface |
| 132 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/library_api.html>`_ |
| 133 | + allows the full functionality of Triton to be included directly in |
| 134 | + an application. |
| 135 | + |
39 | 136 | .. overview-end-marker-do-not-remove
|
40 | 137 |
|
| 138 | +The current release of the Triton Inference Server is 2.2.0 and |
| 139 | +corresponds to the 20.08 release of the tensorrtserver container on |
| 140 | +`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for |
| 141 | +this release is `r20.08 |
| 142 | +<https://github.com/triton-inference-server/server/tree/r20.08>`_. |
| 143 | + |
| 144 | +Backwards Compatibility |
| 145 | +----------------------- |
| 146 | + |
| 147 | +Version 2 of Triton is beta quality, so you should expect some changes |
| 148 | +to the server and client protocols and APIs. Version 2 of Triton does |
| 149 | +not generally maintain backwards compatibility with version 1. |
| 150 | +Specifically, you should take the following items into account when |
| 151 | +transitioning from version 1 to version 2: |
| 152 | + |
| 153 | +* The Triton executables and libraries are in /opt/tritonserver. The |
| 154 | + Triton executable is /opt/tritonserver/bin/tritonserver. |
| 155 | + |
| 156 | +* Some *tritonserver* command-line arguments are removed, changed or |
| 157 | + have different default behavior in version 2. |
| 158 | + |
| 159 | + * --api-version, --http-health-port, --grpc-infer-thread-count, |
| 160 | + --grpc-stream-infer-thread-count,--allow-poll-model-repository, --allow-model-control |
| 161 | + and --tf-add-vgpu are removed. |
| 162 | + |
| 163 | + * The default for --model-control-mode is changed to *none*. |
| 164 | + |
| 165 | + * --tf-allow-soft-placement and --tf-gpu-memory-fraction are renamed |
| 166 | + to --backend-config="tensorflow,allow-soft-placement=<true,false>" |
| 167 | + and --backend-config="tensorflow,gpu-memory-fraction=<float>". |
| 168 | + |
| 169 | +* The HTTP/REST and GRPC protocols, while conceptually similar to |
| 170 | + version 1, are completely changed in version 2. See the `inference |
| 171 | + protocols |
| 172 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/http_grpc_api.html>`_ |
| 173 | + section of the documentation for more information. |
| 174 | + |
| 175 | +* Python and C++ client libraries are re-implemented to match the new |
| 176 | + HTTP/REST and GRPC protocols. The Python client no longer depends on |
| 177 | + a C++ shared library and so should be usable on any platform that |
| 178 | + supports Python. See the `client libraries |
| 179 | + <https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_library.html>`_ |
| 180 | + section of the documentaion for more information. |
| 181 | + |
| 182 | +* The version 2 cmake build requires these changes: |
| 183 | + |
| 184 | + * The cmake flag names have changed from having a TRTIS prefix to |
| 185 | + having a TRITON prefix. For example, TRITON_ENABLE_TENSORRT. |
| 186 | + |
| 187 | + * The build targets are *server*, *client* and *custom-backend* to |
| 188 | + build the server, client libraries and examples, and custom |
| 189 | + backend SDK, respectively. |
| 190 | + |
| 191 | +* In the Docker containers the environment variables indicating the |
| 192 | + Triton version have changed to have a TRITON prefix, for example, |
| 193 | + TRITON_SERVER_VERSION. |
| 194 | + |
| 195 | +Documentation |
| 196 | +------------- |
| 197 | + |
| 198 | +The User Guide, Developer Guide, and API Reference `documentation for |
| 199 | +the current release |
| 200 | +<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html>`_ |
| 201 | +provide guidance on installing, building, and running Triton Inference |
| 202 | +Server. |
| 203 | + |
| 204 | +You can also view the `documentation for the master branch |
| 205 | +<https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/index.html>`_ |
| 206 | +and for `earlier releases |
| 207 | +<https://docs.nvidia.com/deeplearning/triton-inference-server/archives/index.html>`_. |
| 208 | + |
| 209 | +NVIDIA publishes a number of `deep learning examples that use Triton |
| 210 | +<https://github.com/NVIDIA/DeepLearningExamples>`_. |
| 211 | + |
| 212 | +An `FAQ |
| 213 | +<https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/faq.html>`_ |
| 214 | +provides answers for frequently asked questions. |
| 215 | + |
| 216 | +READMEs for deployment examples can be found in subdirectories of |
| 217 | +deploy/, for example, `deploy/single_server/README.rst |
| 218 | +<https://github.com/triton-inference-server/server/tree/master/deploy/single_server/README.rst>`_. |
| 219 | + |
| 220 | +The `Release Notes |
| 221 | +<https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html>`_ |
| 222 | +and `Support Matrix |
| 223 | +<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_ |
| 224 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 225 | +describe which GPUs are supported by Triton. |
| 226 | + |
| 227 | +Presentations and Papers |
| 228 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 229 | + |
| 230 | +* `Maximizing Deep Learning Inference Performance with NVIDIA Model Analyzer <https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/>`_. |
| 231 | + |
| 232 | +* `High-Performance Inferencing at Scale Using the TensorRT Inference Server <https://developer.nvidia.com/gtc/2020/video/s22418>`_. |
| 233 | + |
| 234 | +* `Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing <https://developer.nvidia.com/gtc/2020/video/s22459>`_. |
| 235 | + |
| 236 | +* `Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU <https://developer.nvidia.com/gtc/2020/video/s21736>`_. |
| 237 | + |
| 238 | +* `Maximizing Utilization for Data Center Inference with TensorRT |
| 239 | + Inference Server |
| 240 | + <https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_. |
| 241 | + |
| 242 | +* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference |
| 243 | + <https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_. |
| 244 | + |
| 245 | +* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 246 | + Inference Server and Kubeflow |
| 247 | + <https://www.kubeflow.org/blog/nvidia_tensorrt/>`_. |
| 248 | + |
| 249 | +Contributing |
| 250 | +------------ |
| 251 | + |
| 252 | +Contributions to Triton Inference Server are more than welcome. To |
| 253 | +contribute make a pull request and follow the guidelines outlined in |
| 254 | +the `Contributing <CONTRIBUTING.md>`_ document. |
| 255 | + |
| 256 | +Reporting problems, asking questions |
| 257 | +------------------------------------ |
| 258 | + |
| 259 | +We appreciate any feedback, questions or bug reporting regarding this |
| 260 | +project. When help with code is needed, follow the process outlined in |
| 261 | +the Stack Overflow (https://stackoverflow.com/help/mcve) |
| 262 | +document. Ensure posted examples are: |
| 263 | + |
| 264 | +* minimal – use as little code as possible that still produces the |
| 265 | + same problem |
| 266 | + |
| 267 | +* complete – provide all parts needed to reproduce the problem. Check |
| 268 | + if you can strip external dependency and still show the problem. The |
| 269 | + less time we spend on reproducing problems the more time we have to |
| 270 | + fix it |
| 271 | + |
| 272 | +* verifiable – test the code you're about to provide to make sure it |
| 273 | + reproduces the problem. Remove all other problems that are not |
| 274 | + related to your request/question. |
| 275 | + |
41 | 276 | .. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
|
42 | 277 | :target: https://opensource.org/licenses/BSD-3-Clause
|
0 commit comments