|
30 | 30 |
|
31 | 31 | # Triton Inference Server
|
32 | 32 |
|
33 |
| -**NOTE: You are currently on the r21.06 branch which tracks stabilization |
34 |
| -towards the next release. This branch is not usable during stabilization.** |
| 33 | +Triton Inference Server provides a cloud and edge inferencing solution |
| 34 | +optimized for both CPUs and GPUs. Triton supports an HTTP/REST and |
| 35 | +GRPC protocol that allows remote clients to request inferencing for |
| 36 | +any model being managed by the server. For edge deployments, Triton is |
| 37 | +available as a shared library with a C API that allows the full |
| 38 | +functionality of Triton to be included directly in an |
| 39 | +application. |
| 40 | + |
| 41 | +The current release of the Triton Inference Server is 2.11.0 and |
| 42 | +corresponds to the 21.06 release of the tritonserver container on |
| 43 | +[NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com). The branch for this |
| 44 | +release is |
| 45 | +[r21.06](https://github.com/triton-inference-server/server/tree/r21.06). |
| 46 | + |
| 47 | +## What's New in 2.11.0 |
| 48 | + |
| 49 | +* The [Forest Inference Library (FIL)](https://github.com/triton-inference-server/fil_backend) |
| 50 | + backend is added to Triton. The FIL backend allows forest models trained by |
| 51 | + several popular machine learning frameworks (including XGBoost, LightGBM, |
| 52 | + Scikit-Learn, and cuML) to be deployed in a Triton. |
| 53 | + |
| 54 | +* Windows version of Triton now includes the |
| 55 | + [OpenVino backend](https://github.com/triton-inference-server/openvino_backend). |
| 56 | + |
| 57 | +* The Performance Analyzer (perf_analyzer) now supports testing against the |
| 58 | + Triton C API. |
| 59 | + |
| 60 | +* The Python backend now allows the use of conda to create a unique execution |
| 61 | + environment for your Python model. See https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments. |
| 62 | + |
| 63 | +* Python models that crash or exit unexpectedly are now automatically restarted |
| 64 | + by Triton. |
| 65 | + |
| 66 | +* Model repositories in S3 storage can now be accessed using HTTPS protocol. See |
| 67 | + https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#s3 |
| 68 | + for more information. |
| 69 | + |
| 70 | +* Triton now collects GPU metrics for MIG partitions. |
| 71 | + |
| 72 | +* Passive model instances can now be specified in the model configuration. A |
| 73 | + passive model instance will be loaded and initialized by Triton, but no |
| 74 | + inference requests will be sent to the instance. Passive instances are |
| 75 | + typically used by a custom backend that uses its own mechanisms to distribute |
| 76 | + work to the passive instances. See the ModelInstanceGroup section of |
| 77 | + [model_config.proto](https://github.com/triton-inference-server/common/blob/r21.06/protobuf/model_config.proto) for the setting. |
| 78 | + |
| 79 | +* NVDLA support is added to the TensorRT backend. |
| 80 | + |
| 81 | +* ONNX Runtime version updated to 1.8.0. |
| 82 | + |
| 83 | +* Windows build documentation simplified and improved. |
| 84 | + |
| 85 | +* Improved detailed and summary reports in Model Analyzer. |
| 86 | + |
| 87 | +* Added an offline mode to Model Analyzer. |
| 88 | + |
| 89 | +## Features |
| 90 | + |
| 91 | +* [Multiple deep-learning |
| 92 | + frameworks](https://github.com/triton-inference-server/backend). Triton |
| 93 | + can manage any number and mix of models (limited by system disk and |
| 94 | + memory resources). Triton supports TensorRT, TensorFlow GraphDef, |
| 95 | + TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model |
| 96 | + formats. Both TensorFlow 1.x and TensorFlow 2.x are |
| 97 | + supported. Triton also supports TensorFlow-TensorRT and |
| 98 | + ONNX-TensorRT integrated models. |
| 99 | + |
| 100 | +* [Concurrent model |
| 101 | + execution](docs/architecture.md#concurrent-model-execution). Multiple |
| 102 | + models (or multiple instances of the same model) can run |
| 103 | + simultaneously on the same GPU or on multiple GPUs. |
| 104 | + |
| 105 | +* [Dynamic batching](docs/architecture.md#models-and-schedulers). For |
| 106 | + models that support batching, Triton implements multiple scheduling |
| 107 | + and batching algorithms that combine individual inference requests |
| 108 | + together to improve inference throughput. These scheduling and |
| 109 | + batching decisions are transparent to the client requesting |
| 110 | + inference. |
| 111 | + |
| 112 | +* [Extensible |
| 113 | + backends](https://github.com/triton-inference-server/backend). In |
| 114 | + addition to deep-learning frameworks, Triton provides a *backend |
| 115 | + API* that allows Triton to be extended with any model execution |
| 116 | + logic implemented in |
| 117 | + [Python](https://github.com/triton-inference-server/python_backend) |
| 118 | + or |
| 119 | + [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api), |
| 120 | + while still benefiting from the CPU and GPU support, concurrent |
| 121 | + execution, dynamic batching and other features provided by Triton. |
| 122 | + |
| 123 | +* [Model pipelines](docs/architecture.md#ensemble-models). Triton |
| 124 | + *ensembles* represents a pipeline of one or more models and the |
| 125 | + connection of input and output tensors between those models. A |
| 126 | + single inference request to an ensemble will trigger the execution |
| 127 | + of the entire pipeline. |
| 128 | + |
| 129 | +* [HTTP/REST and GRPC inference |
| 130 | + protocols](docs/inference_protocols.md) based on the community |
| 131 | + developed [KFServing |
| 132 | + protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2). |
| 133 | + |
| 134 | +* A [C API](docs/inference_protocols.md#c-api) allows Triton to be |
| 135 | + linked directly into your application for edge and other in-process |
| 136 | + use cases. |
| 137 | + |
| 138 | +* [Metrics](docs/metrics.md) indicating GPU utilization, server |
| 139 | + throughput, and server latency. The metrics are provided in |
| 140 | + Prometheus data format. |
| 141 | + |
| 142 | +## Documentation |
| 143 | + |
| 144 | +[Triton Architecture](docs/architecture.md) gives a high-level |
| 145 | +overview of the structure and capabilities of the inference |
| 146 | +server. There is also an [FAQ](docs/faq.md). Additional documentation |
| 147 | +is divided into [*user*](#user-documentation) and |
| 148 | +[*developer*](#developer-documentation) sections. The *user* |
| 149 | +documentation describes how to use Triton as an inference solution, |
| 150 | +including information on how to configure Triton, how to organize and |
| 151 | +configure your models, how to use the C++ and Python clients, etc. The |
| 152 | +*developer* documentation describes how to build and test Triton and |
| 153 | +also how Triton can be extended with new functionality. |
| 154 | + |
| 155 | +The Triton [Release |
| 156 | +Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
| 157 | +and [Support |
| 158 | +Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
| 159 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 160 | +describe supported GPUs. |
| 161 | + |
| 162 | +### User Documentation |
| 163 | + |
| 164 | +- [QuickStart](docs/quickstart.md) |
| 165 | + - [Install](docs/quickstart.md#install-triton-docker-image) |
| 166 | + - [Run](docs/quickstart.md#run-triton) |
| 167 | +- [Model Repository](docs/model_repository.md) |
| 168 | +- [Model Configuration](docs/model_configuration.md) |
| 169 | +- [Model Management](docs/model_management.md) |
| 170 | +- [Custom Operations](docs/custom_operations.md) |
| 171 | +- [Client Libraries and Examples](https://github.com/triton-inference-server/client) |
| 172 | +- [Optimization](docs/optimization.md) |
| 173 | + - [Model Analyzer](docs/model_analyzer.md) |
| 174 | + - [Performance Analyzer](docs/perf_analyzer.md) |
| 175 | +- [Metrics](docs/metrics.md) |
| 176 | + |
| 177 | +The [quickstart](docs/quickstart.md) walks you through all the steps |
| 178 | +required to install and run Triton with an example image |
| 179 | +classification model and then use an example client application to |
| 180 | +perform inferencing using that model. The quickstart also demonstrates |
| 181 | +how [Triton supports both GPU systems and CPU-only |
| 182 | +systems](docs/quickstart.md#run-triton). |
| 183 | + |
| 184 | +The first step in using Triton to serve your models is to place one or |
| 185 | +more models into a [model |
| 186 | +repository](docs/model_repository.md). Optionally, depending on the type |
| 187 | +of the model and on what Triton capabilities you want to enable for |
| 188 | +the model, you may need to create a [model |
| 189 | +configuration](docs/model_configuration.md) for the model. If your |
| 190 | +model has [custom operations](docs/custom_operations.md) you will need |
| 191 | +to make sure they are loaded correctly by Triton. |
| 192 | + |
| 193 | +After you have your model(s) available in Triton, you will want to |
| 194 | +send inference and other requests to Triton from your *client* |
| 195 | +application. The [Python and C++ client |
| 196 | +libraries](https://github.com/triton-inference-server/client) provide |
| 197 | +APIs to simplify this communication. There are also a large number of |
| 198 | +[client examples](https://github.com/triton-inference-server/client) |
| 199 | +that demonstrate how to use the libraries. You can also send |
| 200 | +HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based |
| 201 | +protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or |
| 202 | +[generate a GRPC client for many other |
| 203 | +languages](https://github.com/triton-inference-server/client). |
| 204 | + |
| 205 | +Understanding and [optimizing performance](docs/optimization.md) is an |
| 206 | +important part of deploying your models. The Triton project provides |
| 207 | +the [Performance Analyzer](docs/perf_analyzer.md) and the [Model |
| 208 | +Analyzer](docs/model_analyzer.md) to help your optimization |
| 209 | +efforts. Specifically, you will want to optimize [scheduling and |
| 210 | +batching](docs/architecture.md#models-and-schedulers) and [model |
| 211 | +instances](docs/model_configuration.md#instance-groups) appropriately |
| 212 | +for each model. You may also want to consider [ensembling multiple |
| 213 | +models and pre/post-processing](docs/architecture.md#ensemble-models) |
| 214 | +into a pipeline. In some cases you may find [individual inference |
| 215 | +request trace data](docs/trace.md) useful when optimizing. A |
| 216 | +[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize |
| 217 | +and monitor aggregate inference metrics. |
| 218 | + |
| 219 | +NVIDIA publishes a number of [deep learning |
| 220 | +examples](https://github.com/NVIDIA/DeepLearningExamples) that use |
| 221 | +Triton. |
| 222 | + |
| 223 | +As part of your deployment strategy you may want to [explicitly manage |
| 224 | +what models are available by loading and unloading |
| 225 | +models](docs/model_management.md) from a running Triton server. If you |
| 226 | +are using Kubernetes for deployment there are simple examples of how |
| 227 | +to deploy Triton using Kubernetes and Helm, one for |
| 228 | +[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md). |
| 229 | + |
| 230 | +The [version 1 to version 2 migration |
| 231 | +information](docs/v1_to_v2.md) is helpful if you are moving to |
| 232 | +version 2 of Triton from previously using version 1. |
| 233 | + |
| 234 | +### Developer Documentation |
| 235 | + |
| 236 | +- [Build](docs/build.md) |
| 237 | +- [Protocols and APIs](docs/inference_protocols.md). |
| 238 | +- [Backends](https://github.com/triton-inference-server/backend) |
| 239 | +- [Repository Agents](docs/repository_agents.md) |
| 240 | +- [Test](docs/test.md) |
| 241 | + |
| 242 | +Triton can be [built using |
| 243 | +Docker](docs/build.md#building-triton-with-docker) or [built without |
| 244 | +Docker](docs/build.md#building-triton-without-docker). After building |
| 245 | +you should [test Triton](docs/test.md). |
| 246 | + |
| 247 | +It is also possible to [create a Docker image containing a customized |
| 248 | +Triton](docs/compose.md) that contains only a subset of the backends. |
| 249 | + |
| 250 | +The Triton project also provides [client libraries for Python and |
| 251 | +C++](https://github.com/triton-inference-server/client) that make it |
| 252 | +easy to communicate with the server. There are also a large number of |
| 253 | +[example clients](https://github.com/triton-inference-server/client) |
| 254 | +that demonstrate how to use the libraries. You can also develop your |
| 255 | +own clients that directly communicate with Triton using [HTTP/REST or |
| 256 | +GRPC protocols](docs/inference_protocols.md). There is also a [C |
| 257 | +API](docs/inference_protocols.md) that allows Triton to be linked |
| 258 | +directly into your application. |
| 259 | + |
| 260 | +A [Triton backend](https://github.com/triton-inference-server/backend) |
| 261 | +is the implementation that executes a model. A backend can interface |
| 262 | +with a deep learning framework, like PyTorch, TensorFlow, TensorRT or |
| 263 | +ONNX Runtime; or it can interface with a data processing framework |
| 264 | +like [DALI](https://github.com/triton-inference-server/dali_backend); |
| 265 | +or you can extend Triton by [writing your own |
| 266 | +backend](https://github.com/triton-inference-server/backend) in either |
| 267 | +[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api) |
| 268 | +or |
| 269 | +[Python](https://github.com/triton-inference-server/python_backend). |
| 270 | + |
| 271 | +A [Triton repository agent](docs/repository_agents.md) extends Triton |
| 272 | +with new functionality that operates when a model is loaded or |
| 273 | +unloaded. You can introduce your own code to perform authentication, |
| 274 | +decryption, conversion, or similar operations when a model is loaded. |
| 275 | + |
| 276 | +## Papers and Presentation |
| 277 | + |
| 278 | +* [Maximizing Deep Learning Inference Performance with NVIDIA Model |
| 279 | + Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/). |
| 280 | + |
| 281 | +* [High-Performance Inferencing at Scale Using the TensorRT Inference |
| 282 | + Server](https://developer.nvidia.com/gtc/2020/video/s22418). |
| 283 | + |
| 284 | +* [Accelerate and Autoscale Deep Learning Inference on GPUs with |
| 285 | + KFServing](https://developer.nvidia.com/gtc/2020/video/s22459). |
| 286 | + |
| 287 | +* [Deep into Triton Inference Server: BERT Practical Deployment on |
| 288 | + NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736). |
| 289 | + |
| 290 | +* [Maximizing Utilization for Data Center Inference with TensorRT |
| 291 | + Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server). |
| 292 | + |
| 293 | +* [NVIDIA TensorRT Inference Server Boosts Deep Learning |
| 294 | + Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/). |
| 295 | + |
| 296 | +* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 297 | + Inference Server and |
| 298 | + Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/). |
| 299 | + |
| 300 | +## Contributing |
| 301 | + |
| 302 | +Contributions to Triton Inference Server are more than welcome. To |
| 303 | +contribute make a pull request and follow the guidelines outlined in |
| 304 | +[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client, |
| 305 | +example or similar contribution that is not modifying the core of |
| 306 | +Triton, then you should file a PR in the [contrib |
| 307 | +repo](https://github.com/triton-inference-server/contrib). |
| 308 | + |
| 309 | +## Reporting problems, asking questions |
| 310 | + |
| 311 | +We appreciate any feedback, questions or bug reporting regarding this |
| 312 | +project. When help with code is needed, follow the process outlined in |
| 313 | +the Stack Overflow (https://stackoverflow.com/help/mcve) |
| 314 | +document. Ensure posted examples are: |
| 315 | + |
| 316 | +* minimal – use as little code as possible that still produces the |
| 317 | + same problem |
| 318 | + |
| 319 | +* complete – provide all parts needed to reproduce the problem. Check |
| 320 | + if you can strip external dependency and still show the problem. The |
| 321 | + less time we spend on reproducing problems the more time we have to |
| 322 | + fix it |
| 323 | + |
| 324 | +* verifiable – test the code you're about to provide to make sure it |
| 325 | + reproduces the problem. Remove all other problems that are not |
| 326 | + related to your request/question. |
0 commit comments