|
30 | 30 |
|
31 | 31 | # Triton Inference Server
|
32 | 32 |
|
33 |
| - **NOTE: You are currently on the r20.11 branch which tracks stabilization |
34 |
| - towards the next release. This branch is not usable during stabilization.** |
| 33 | +Triton Inference Server provides a cloud and edge inferencing solution |
| 34 | +optimized for both CPUs and GPUs. Triton supports an HTTP/REST and |
| 35 | +GRPC protocol that allows remote clients to request inferencing for |
| 36 | +any model being managed by the server. For edge deployments, Triton is |
| 37 | +available as a shared library with a C API that allows the full |
| 38 | +functionality of Triton to be included directly in an |
| 39 | +application. |
| 40 | + |
| 41 | +## What's New in 2.5.0 |
| 42 | + |
| 43 | +* ONNX Runtime backend updated to use ONNX Runtime 1.5.3. |
| 44 | + |
| 45 | +* The PyTorch backend is moved to a dedicated repo |
| 46 | + triton-inference-server/pytorch_backend. |
| 47 | + |
| 48 | +* The Caffe2 backend is removed. Caffe2 models are no longer supported. |
| 49 | + |
| 50 | +* Fix handling of failed model reloads. If a model reload fails, the currently |
| 51 | + loaded version of the model will remain loaded and its availability will be uninterrupted. |
| 52 | + |
| 53 | +* Releasing Triton ModelAnalyzer in the Triton SDK container and as a PIP |
| 54 | + package available in NVIDIA PyIndex. |
| 55 | + |
| 56 | +## Features |
| 57 | + |
| 58 | +* [Multiple deep-learning |
| 59 | + frameworks](https://github.com/triton-inference-server/backend). Triton |
| 60 | + can manage any number and mix of models (limited by system disk and |
| 61 | + memory resources). Triton supports TensorRT, TensorFlow GraphDef, |
| 62 | + TensorFlow SavedModel, ONNX, and PyTorch TorchScript model |
| 63 | + formats. Both TensorFlow 1.x and TensorFlow 2.x are |
| 64 | + supported. Triton also supports TensorFlow-TensorRT and |
| 65 | + ONNX-TensorRT integrated models. |
| 66 | + |
| 67 | +* [Concurrent model |
| 68 | + execution](docs/architecture.md#concurrent-model-execution). Multiple |
| 69 | + models (or multiple instances of the same model) can run |
| 70 | + simultaneously on the same GPU or on multiple GPUs. |
| 71 | + |
| 72 | +* [Dynamic batching](docs/architecture.md#models-and-schedulers). For |
| 73 | + models that support batching, Triton implements multiple scheduling |
| 74 | + and batching algorithms that combine individual inference requests |
| 75 | + together to improve inference throughput. These scheduling and |
| 76 | + batching decisions are transparent to the client requesting |
| 77 | + inference. |
| 78 | + |
| 79 | +* [Extensible |
| 80 | + backends](https://github.com/triton-inference-server/backend). In |
| 81 | + addition to deep-learning frameworks, Triton provides a *backend |
| 82 | + API* that allows Triton to be extended with any model execution |
| 83 | + logic implemented in |
| 84 | + [Python](https://github.com/triton-inference-server/python_backend) |
| 85 | + or |
| 86 | + [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api), |
| 87 | + while still benefiting from the CPU and GPU support, concurrent |
| 88 | + execution, dynamic batching and other features provided by Triton. |
| 89 | + |
| 90 | +* [Model pipelines](docs/architecture.md#ensemble-models). Triton |
| 91 | + *ensembles* represents a pipeline of one or more models and the |
| 92 | + connection of input and output tensors between those models. A |
| 93 | + single inference request to an ensemble will trigger the execution |
| 94 | + of the entire pipeline. |
| 95 | + |
| 96 | +* [HTTP/REST and GRPC inference |
| 97 | + protocols](docs/inference_protocols.md) based on the community |
| 98 | + developed [KFServing |
| 99 | + protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2). |
| 100 | + |
| 101 | +* [Metrics](docs/metrics.md) indicating GPU utilization, server |
| 102 | + throughput, and server latency. The metrics are provided in |
| 103 | + Prometheus data format. |
| 104 | + |
| 105 | +## Documentation |
| 106 | + |
| 107 | +[Triton Architecture](docs/architecture.md) gives a high-level |
| 108 | +overview of the structure and capabilities of the inference |
| 109 | +server. There is also an [FAQ](docs/faq.md). Additional documentation |
| 110 | +is divided into [*user*](#user-documentation) and |
| 111 | +[*developer*](#developer-documentation) sections. The *user* |
| 112 | +documentation describes how to use Triton as an inference solution, |
| 113 | +including information on how to configure Triton, how to organize and |
| 114 | +configure your models, how to use the C++ and Python clients, etc. The |
| 115 | +*developer* documentation describes how to build and test Triton and |
| 116 | +also how Triton can be extended with new functionality. |
| 117 | + |
| 118 | +The Triton [Release |
| 119 | +Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
| 120 | +and [Support |
| 121 | +Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
| 122 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 123 | +describe supported GPUs. |
| 124 | + |
| 125 | +### User Documentation |
| 126 | + |
| 127 | +- [QuickStart](docs/quickstart.md) |
| 128 | + - [Install](docs/quickstart.md#install-triton-docker-image) |
| 129 | + - [Run](docs/quickstart.md#run-triton) |
| 130 | +- [Model Repository](docs/model_repository.md) |
| 131 | +- [Model Configuration](docs/model_configuration.md) |
| 132 | +- [Model Management](docs/model_management.md) |
| 133 | +- [Custom Operations](docs/custom_operations.md) |
| 134 | +- [Client Libraries](docs/client_libraries.md) |
| 135 | +- [Client Examples](docs/client_examples.md) |
| 136 | +- [Optimization](docs/optimization.md) |
| 137 | + - [Model Analyzer](docs/model_analyzer.md) |
| 138 | + - [Performance Analyzer](docs/perf_analyzer.md) |
| 139 | +- [Metrics](docs/metrics.md) |
| 140 | + |
| 141 | +The [quickstart](docs/quickstart.md) walks you through all the steps |
| 142 | +required to install and run Triton with an example image |
| 143 | +classification model and then use an example client application to |
| 144 | +perform inferencing using that model. The quickstart also demonstrates |
| 145 | +how [Triton supports both GPU systems and CPU-only |
| 146 | +systems](docs/quickstart.md#run-triton). |
| 147 | + |
| 148 | +The first step in using Triton to serve your models is to place one or |
| 149 | +more models into a [model |
| 150 | +repository](docs/model_repository.md). Optionally, depending on the type |
| 151 | +of the model and on what Triton capabilities you want to enable for |
| 152 | +the model, you may need to create a [model |
| 153 | +configuration](docs/model_configuration.md) for the model. If your |
| 154 | +model has [custom operations](docs/custom_operations.md) you will need |
| 155 | +to make sure they are loaded correctly by Triton. |
| 156 | + |
| 157 | +After you have your model(s) available in Triton, you will want to |
| 158 | +send inference and other requests to Triton from your *client* |
| 159 | +application. The [Python and C++ client |
| 160 | +libraries](docs/client_libraries.md) provide |
| 161 | +[APIs](docs/client_libraries.md#client-library-apis) to simplify this |
| 162 | +communication. There are also a large number of [client |
| 163 | +examples](docs/client_examples.md) that demonstrate how to use the |
| 164 | +libraries. You can also send HTTP/REST requests directly to Triton |
| 165 | +using the [HTTP/REST JSON-based |
| 166 | +protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or |
| 167 | +[generate a GRPC client for many other |
| 168 | +languages](docs/client_libraries.md). |
| 169 | + |
| 170 | +Understanding and [optimizing performance](docs/optimization.md) is an |
| 171 | +important part of deploying your models. The Triton project provides |
| 172 | +the [Performance Analyzer](docs/perf_analyzer.md) and the [Model |
| 173 | +Analyzer](docs/model_analyzer.md) to help your optimization |
| 174 | +efforts. Specifically, you will want to optimize [scheduling and |
| 175 | +batching](docs/architecture.md#models-and-schedulers) and [model |
| 176 | +instances](docs/model_configuration.md#instance-groups) appropriately |
| 177 | +for each model. You may also want to consider [ensembling multiple |
| 178 | +models and pre/post-processing](docs/architecture.md#ensemble-models) |
| 179 | +into a pipeline. In some cases you may find [individual inference |
| 180 | +request trace data](docs/trace.md) useful when optimizing. A |
| 181 | +[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize |
| 182 | +and monitor aggregate inference metrics. |
| 183 | + |
| 184 | +NVIDIA publishes a number of [deep learning |
| 185 | +examples](https://github.com/NVIDIA/DeepLearningExamples) that use |
| 186 | +Triton. |
| 187 | + |
| 188 | +As part of you deployment strategy you may want to [explicitly manage |
| 189 | +what models are available by loading and unloading |
| 190 | +models](docs/model_management.md) from a running Triton server. If you |
| 191 | +are using Kubernetes for deployment a simple example of how to [deploy |
| 192 | +Triton using Kubernetes and Helm](deploy/single_server/README.rst) may |
| 193 | +be helpful. |
| 194 | + |
| 195 | +The [version 1 to version 2 migration |
| 196 | +information](docs/v1_to_v2.md) is helpful if you are moving to |
| 197 | +version 2 of Triton from previously using version 1. |
| 198 | + |
| 199 | +### Developer Documentation |
| 200 | + |
| 201 | +- [Build](docs/build.md) |
| 202 | +- [Protocols and APIs](docs/inference_protocols.md). |
| 203 | +- [Backends](https://github.com/triton-inference-server/backend) |
| 204 | +- [Test](docs/test.md) |
| 205 | + |
| 206 | +Triton can be [built using |
| 207 | +Docker](docs/build.md#building-triton-with-docker) or [built without |
| 208 | +Docker](docs/build.md#building-triton-without-docker). After building |
| 209 | +you should [test Triton](docs/test.md). |
| 210 | + |
| 211 | +Starting with the r20.10 release, it is also possible to [create a |
| 212 | +Docker image containing a customized Triton](docs/compose.md) that |
| 213 | +contains only a subset of the backends. |
| 214 | + |
| 215 | +The Triton project also provides [client libraries for Python and |
| 216 | +C++](docs/client_libraries.md) that make it easy to communicate with |
| 217 | +the server. There are also a large number of [example |
| 218 | +clients](docs/client_examples.md) that demonstrate how to use the |
| 219 | +libraries. You can also develop your own clients that directly |
| 220 | +communicate with Triton using [HTTP/REST or GRPC |
| 221 | +protocols](docs/inference_protocols.md). There is also a [C |
| 222 | +API](docs/inference_protocols.md) that allows Triton to be linked |
| 223 | +directly into your application. |
| 224 | + |
| 225 | +A [Triton backend](https://github.com/triton-inference-server/backend) |
| 226 | +is the implementation that executes a model. A backend can interface |
| 227 | +with a deep learning framework, like PyTorch, TensorFlow, TensorRT or |
| 228 | +ONNX Runtime; or it can interface with a data processing framework |
| 229 | +like [DALI](https://github.com/triton-inference-server/dali_backend); |
| 230 | +or it can be custom |
| 231 | +[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api) |
| 232 | +or [Python](https://github.com/triton-inference-server/python_backend) |
| 233 | +code for performing any operation. You can even extend Triton by |
| 234 | +[writing your own |
| 235 | +backend](https://github.com/triton-inference-server/backend). |
| 236 | + |
| 237 | +## Papers and Presentation |
| 238 | + |
| 239 | +* [Maximizing Deep Learning Inference Performance with NVIDIA Model |
| 240 | + Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/). |
| 241 | + |
| 242 | +* [High-Performance Inferencing at Scale Using the TensorRT Inference |
| 243 | + Server](https://developer.nvidia.com/gtc/2020/video/s22418). |
| 244 | + |
| 245 | +* [Accelerate and Autoscale Deep Learning Inference on GPUs with |
| 246 | + KFServing](https://developer.nvidia.com/gtc/2020/video/s22459). |
| 247 | + |
| 248 | +* [Deep into Triton Inference Server: BERT Practical Deployment on |
| 249 | + NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736). |
| 250 | + |
| 251 | +* [Maximizing Utilization for Data Center Inference with TensorRT |
| 252 | + Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server). |
| 253 | + |
| 254 | +* [NVIDIA TensorRT Inference Server Boosts Deep Learning |
| 255 | + Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/). |
| 256 | + |
| 257 | +* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 258 | + Inference Server and |
| 259 | + Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/). |
| 260 | + |
| 261 | +## Contributing |
| 262 | + |
| 263 | +Contributions to Triton Inference Server are more than welcome. To |
| 264 | +contribute make a pull request and follow the guidelines outlined in |
| 265 | +[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client, |
| 266 | +example or similar contribution that is not modifying the core of |
| 267 | +Triton, then you should file a PR in the [contrib |
| 268 | +repo](https://github.com/triton-inference-server/contrib). |
| 269 | + |
| 270 | +## Reporting problems, asking questions |
| 271 | + |
| 272 | +We appreciate any feedback, questions or bug reporting regarding this |
| 273 | +project. When help with code is needed, follow the process outlined in |
| 274 | +the Stack Overflow (https://stackoverflow.com/help/mcve) |
| 275 | +document. Ensure posted examples are: |
| 276 | + |
| 277 | +* minimal – use as little code as possible that still produces the |
| 278 | + same problem |
| 279 | + |
| 280 | +* complete – provide all parts needed to reproduce the problem. Check |
| 281 | + if you can strip external dependency and still show the problem. The |
| 282 | + less time we spend on reproducing problems the more time we have to |
| 283 | + fix it |
| 284 | + |
| 285 | +* verifiable – test the code you're about to provide to make sure it |
| 286 | + reproduces the problem. Remove all other problems that are not |
| 287 | + related to your request/question. |
0 commit comments