|
30 | 30 |
|
31 | 31 | # Triton Inference Server
|
32 | 32 |
|
33 |
| -**NOTE: You are currently on the r21.04 branch which tracks stabilization |
34 |
| -towards the next release. This branch is not usable during stabilization.** |
| 33 | +Triton Inference Server provides a cloud and edge inferencing solution |
| 34 | +optimized for both CPUs and GPUs. Triton supports an HTTP/REST and |
| 35 | +GRPC protocol that allows remote clients to request inferencing for |
| 36 | +any model being managed by the server. For edge deployments, Triton is |
| 37 | +available as a shared library with a C API that allows the full |
| 38 | +functionality of Triton to be included directly in an |
| 39 | +application. |
| 40 | + |
| 41 | +## What's New in 2.9.0 |
| 42 | + |
| 43 | +* Python backend performance has been increased significantly. |
| 44 | + |
| 45 | +* Onnx Runtime update to version 1.7.1. |
| 46 | + |
| 47 | +* Triton Server is now available as a GKE Marketplace Application, see |
| 48 | + https://github.com/triton-inference-server/server/tree/master/deploy/gke-marketplace-app. |
| 49 | + |
| 50 | +* The GRPC client libraries now allow compression to be enabled. |
| 51 | + |
| 52 | +* Ragged batching is now supported for TensorFlow models. |
| 53 | + |
| 54 | +* For TensorFlow models represented with SavedModel format, it is now possible |
| 55 | + to choose which graph and signature_def to load. See |
| 56 | + https://github.com/triton-inference-server/tensorflow_backend/tree/r21.04#parameters. |
| 57 | + |
| 58 | +* A Helm Chart example is added for AWS. See |
| 59 | + https://github.com/triton-inference-server/server/tree/master/deploy/aws. |
| 60 | + |
| 61 | +* The Model Control API is enhanced to provide an option when unloading an |
| 62 | + ensemble model. The option allows all contained models to be unloaded as part |
| 63 | + of unloading the ensemble. See |
| 64 | + https://github.com/triton-inference-server/server/blob/master/docs/protocol/extension_model_repository.md#model-repository-extension. |
| 65 | + |
| 66 | +* Model reloading using the Model Control API previously resulted in the model |
| 67 | + being unavailable for a short period of time. This is now fixed so that the |
| 68 | + model remains available during reloading. |
| 69 | + |
| 70 | +* Latency statistics and metrics for TensorRT models are fixed. Previously the |
| 71 | + sum of the "compute input", "compute infer" and "compute output" times |
| 72 | + accurately indicated the entire compute time but the total time could be |
| 73 | + incorrectly attributed across the three components. This incorrect attribution |
| 74 | + is now fixed and all values are now accurate. |
| 75 | + |
| 76 | +* Error reporting is improved for the Azure, S3 and GCS cloud file system |
| 77 | + support. |
| 78 | + |
| 79 | +* Fix trace support for ensembles. The models contained within an ensemble are |
| 80 | + now traced correctly. |
| 81 | + |
| 82 | +* Model Analyzer improvements |
| 83 | + |
| 84 | + * Summary report now includes GPU Power usage |
| 85 | + * Model Analyzer will find the Top N model configuration across multiple models. |
| 86 | + |
| 87 | +## Features |
| 88 | + |
| 89 | +* [Multiple deep-learning |
| 90 | + frameworks](https://github.com/triton-inference-server/backend). Triton |
| 91 | + can manage any number and mix of models (limited by system disk and |
| 92 | + memory resources). Triton supports TensorRT, TensorFlow GraphDef, |
| 93 | + TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model |
| 94 | + formats. Both TensorFlow 1.x and TensorFlow 2.x are |
| 95 | + supported. Triton also supports TensorFlow-TensorRT and |
| 96 | + ONNX-TensorRT integrated models. |
| 97 | + |
| 98 | +* [Concurrent model |
| 99 | + execution](docs/architecture.md#concurrent-model-execution). Multiple |
| 100 | + models (or multiple instances of the same model) can run |
| 101 | + simultaneously on the same GPU or on multiple GPUs. |
| 102 | + |
| 103 | +* [Dynamic batching](docs/architecture.md#models-and-schedulers). For |
| 104 | + models that support batching, Triton implements multiple scheduling |
| 105 | + and batching algorithms that combine individual inference requests |
| 106 | + together to improve inference throughput. These scheduling and |
| 107 | + batching decisions are transparent to the client requesting |
| 108 | + inference. |
| 109 | + |
| 110 | +* [Extensible |
| 111 | + backends](https://github.com/triton-inference-server/backend). In |
| 112 | + addition to deep-learning frameworks, Triton provides a *backend |
| 113 | + API* that allows Triton to be extended with any model execution |
| 114 | + logic implemented in |
| 115 | + [Python](https://github.com/triton-inference-server/python_backend) |
| 116 | + or |
| 117 | + [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api), |
| 118 | + while still benefiting from the CPU and GPU support, concurrent |
| 119 | + execution, dynamic batching and other features provided by Triton. |
| 120 | + |
| 121 | +* [Model pipelines](docs/architecture.md#ensemble-models). Triton |
| 122 | + *ensembles* represents a pipeline of one or more models and the |
| 123 | + connection of input and output tensors between those models. A |
| 124 | + single inference request to an ensemble will trigger the execution |
| 125 | + of the entire pipeline. |
| 126 | + |
| 127 | +* [HTTP/REST and GRPC inference |
| 128 | + protocols](docs/inference_protocols.md) based on the community |
| 129 | + developed [KFServing |
| 130 | + protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2). |
| 131 | + |
| 132 | +* A [C API](docs/inference_protocols.md#c-api) allows Triton to be |
| 133 | + linked directly into your application for edge and other in-process |
| 134 | + use cases. |
| 135 | + |
| 136 | +* [Metrics](docs/metrics.md) indicating GPU utilization, server |
| 137 | + throughput, and server latency. The metrics are provided in |
| 138 | + Prometheus data format. |
| 139 | + |
| 140 | +## Documentation |
| 141 | + |
| 142 | +**The master branch documentation tracks the upcoming, |
| 143 | +under-development release and so may not be accurate for the current |
| 144 | +release of Triton. See the [r21.03 |
| 145 | +documentation](https://github.com/triton-inference-server/server/tree/r21.03#documentation) |
| 146 | +for the current release.** |
| 147 | + |
| 148 | +[Triton Architecture](docs/architecture.md) gives a high-level |
| 149 | +overview of the structure and capabilities of the inference |
| 150 | +server. There is also an [FAQ](docs/faq.md). Additional documentation |
| 151 | +is divided into [*user*](#user-documentation) and |
| 152 | +[*developer*](#developer-documentation) sections. The *user* |
| 153 | +documentation describes how to use Triton as an inference solution, |
| 154 | +including information on how to configure Triton, how to organize and |
| 155 | +configure your models, how to use the C++ and Python clients, etc. The |
| 156 | +*developer* documentation describes how to build and test Triton and |
| 157 | +also how Triton can be extended with new functionality. |
| 158 | + |
| 159 | +The Triton [Release |
| 160 | +Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
| 161 | +and [Support |
| 162 | +Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
| 163 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 164 | +describe supported GPUs. |
| 165 | + |
| 166 | +### User Documentation |
| 167 | + |
| 168 | +- [QuickStart](docs/quickstart.md) |
| 169 | + - [Install](docs/quickstart.md#install-triton-docker-image) |
| 170 | + - [Run](docs/quickstart.md#run-triton) |
| 171 | +- [Model Repository](docs/model_repository.md) |
| 172 | +- [Model Configuration](docs/model_configuration.md) |
| 173 | +- [Model Management](docs/model_management.md) |
| 174 | +- [Custom Operations](docs/custom_operations.md) |
| 175 | +- [Client Libraries](docs/client_libraries.md) |
| 176 | +- [Client Examples](docs/client_examples.md) |
| 177 | +- [Optimization](docs/optimization.md) |
| 178 | + - [Model Analyzer](docs/model_analyzer.md) |
| 179 | + - [Performance Analyzer](docs/perf_analyzer.md) |
| 180 | +- [Metrics](docs/metrics.md) |
| 181 | + |
| 182 | +The [quickstart](docs/quickstart.md) walks you through all the steps |
| 183 | +required to install and run Triton with an example image |
| 184 | +classification model and then use an example client application to |
| 185 | +perform inferencing using that model. The quickstart also demonstrates |
| 186 | +how [Triton supports both GPU systems and CPU-only |
| 187 | +systems](docs/quickstart.md#run-triton). |
| 188 | + |
| 189 | +The first step in using Triton to serve your models is to place one or |
| 190 | +more models into a [model |
| 191 | +repository](docs/model_repository.md). Optionally, depending on the type |
| 192 | +of the model and on what Triton capabilities you want to enable for |
| 193 | +the model, you may need to create a [model |
| 194 | +configuration](docs/model_configuration.md) for the model. If your |
| 195 | +model has [custom operations](docs/custom_operations.md) you will need |
| 196 | +to make sure they are loaded correctly by Triton. |
| 197 | + |
| 198 | +After you have your model(s) available in Triton, you will want to |
| 199 | +send inference and other requests to Triton from your *client* |
| 200 | +application. The [Python and C++ client |
| 201 | +libraries](docs/client_libraries.md) provide |
| 202 | +[APIs](docs/client_libraries.md#client-library-apis) to simplify this |
| 203 | +communication. There are also a large number of [client |
| 204 | +examples](docs/client_examples.md) that demonstrate how to use the |
| 205 | +libraries. You can also send HTTP/REST requests directly to Triton |
| 206 | +using the [HTTP/REST JSON-based |
| 207 | +protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or |
| 208 | +[generate a GRPC client for many other |
| 209 | +languages](docs/client_libraries.md). |
| 210 | + |
| 211 | +Understanding and [optimizing performance](docs/optimization.md) is an |
| 212 | +important part of deploying your models. The Triton project provides |
| 213 | +the [Performance Analyzer](docs/perf_analyzer.md) and the [Model |
| 214 | +Analyzer](docs/model_analyzer.md) to help your optimization |
| 215 | +efforts. Specifically, you will want to optimize [scheduling and |
| 216 | +batching](docs/architecture.md#models-and-schedulers) and [model |
| 217 | +instances](docs/model_configuration.md#instance-groups) appropriately |
| 218 | +for each model. You may also want to consider [ensembling multiple |
| 219 | +models and pre/post-processing](docs/architecture.md#ensemble-models) |
| 220 | +into a pipeline. In some cases you may find [individual inference |
| 221 | +request trace data](docs/trace.md) useful when optimizing. A |
| 222 | +[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize |
| 223 | +and monitor aggregate inference metrics. |
| 224 | + |
| 225 | +NVIDIA publishes a number of [deep learning |
| 226 | +examples](https://github.com/NVIDIA/DeepLearningExamples) that use |
| 227 | +Triton. |
| 228 | + |
| 229 | +As part of your deployment strategy you may want to [explicitly manage |
| 230 | +what models are available by loading and unloading |
| 231 | +models](docs/model_management.md) from a running Triton server. If you |
| 232 | +are using Kubernetes for deployment there are simple examples of how |
| 233 | +to deploy Triton using Kubernetes and Helm, one for |
| 234 | +[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md). |
| 235 | + |
| 236 | +The [version 1 to version 2 migration |
| 237 | +information](docs/v1_to_v2.md) is helpful if you are moving to |
| 238 | +version 2 of Triton from previously using version 1. |
| 239 | + |
| 240 | +### Developer Documentation |
| 241 | + |
| 242 | +- [Build](docs/build.md) |
| 243 | +- [Protocols and APIs](docs/inference_protocols.md). |
| 244 | +- [Backends](https://github.com/triton-inference-server/backend) |
| 245 | +- [Repository Agents](docs/repository_agents.md) |
| 246 | +- [Test](docs/test.md) |
| 247 | + |
| 248 | +Triton can be [built using |
| 249 | +Docker](docs/build.md#building-triton-with-docker) or [built without |
| 250 | +Docker](docs/build.md#building-triton-without-docker). After building |
| 251 | +you should [test Triton](docs/test.md). |
| 252 | + |
| 253 | +It is also possible to [create a Docker image containing a customized |
| 254 | +Triton](docs/compose.md) that contains only a subset of the backends. |
| 255 | + |
| 256 | +The Triton project also provides [client libraries for Python and |
| 257 | +C++](docs/client_libraries.md) that make it easy to communicate with |
| 258 | +the server. There are also a large number of [example |
| 259 | +clients](docs/client_examples.md) that demonstrate how to use the |
| 260 | +libraries. You can also develop your own clients that directly |
| 261 | +communicate with Triton using [HTTP/REST or GRPC |
| 262 | +protocols](docs/inference_protocols.md). There is also a [C |
| 263 | +API](docs/inference_protocols.md) that allows Triton to be linked |
| 264 | +directly into your application. |
| 265 | + |
| 266 | +A [Triton backend](https://github.com/triton-inference-server/backend) |
| 267 | +is the implementation that executes a model. A backend can interface |
| 268 | +with a deep learning framework, like PyTorch, TensorFlow, TensorRT or |
| 269 | +ONNX Runtime; or it can interface with a data processing framework |
| 270 | +like [DALI](https://github.com/triton-inference-server/dali_backend); |
| 271 | +or you can extend Triton by [writing your own |
| 272 | +backend](https://github.com/triton-inference-server/backend) in either |
| 273 | +[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api) |
| 274 | +or |
| 275 | +[Python](https://github.com/triton-inference-server/python_backend). |
| 276 | + |
| 277 | +A [Triton repository agent](docs/repository_agents.md) extends Triton |
| 278 | +with new functionality that operates when a model is loaded or |
| 279 | +unloaded. You can introduce your own code to perform authentication, |
| 280 | +decryption, conversion, or similar operations when a model is loaded. |
| 281 | + |
| 282 | +## Papers and Presentation |
| 283 | + |
| 284 | +* [Maximizing Deep Learning Inference Performance with NVIDIA Model |
| 285 | + Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/). |
| 286 | + |
| 287 | +* [High-Performance Inferencing at Scale Using the TensorRT Inference |
| 288 | + Server](https://developer.nvidia.com/gtc/2020/video/s22418). |
| 289 | + |
| 290 | +* [Accelerate and Autoscale Deep Learning Inference on GPUs with |
| 291 | + KFServing](https://developer.nvidia.com/gtc/2020/video/s22459). |
| 292 | + |
| 293 | +* [Deep into Triton Inference Server: BERT Practical Deployment on |
| 294 | + NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736). |
| 295 | + |
| 296 | +* [Maximizing Utilization for Data Center Inference with TensorRT |
| 297 | + Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server). |
| 298 | + |
| 299 | +* [NVIDIA TensorRT Inference Server Boosts Deep Learning |
| 300 | + Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/). |
| 301 | + |
| 302 | +* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 303 | + Inference Server and |
| 304 | + Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/). |
| 305 | + |
| 306 | +## Contributing |
| 307 | + |
| 308 | +Contributions to Triton Inference Server are more than welcome. To |
| 309 | +contribute make a pull request and follow the guidelines outlined in |
| 310 | +[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client, |
| 311 | +example or similar contribution that is not modifying the core of |
| 312 | +Triton, then you should file a PR in the [contrib |
| 313 | +repo](https://github.com/triton-inference-server/contrib). |
| 314 | + |
| 315 | +## Reporting problems, asking questions |
| 316 | + |
| 317 | +We appreciate any feedback, questions or bug reporting regarding this |
| 318 | +project. When help with code is needed, follow the process outlined in |
| 319 | +the Stack Overflow (https://stackoverflow.com/help/mcve) |
| 320 | +document. Ensure posted examples are: |
| 321 | + |
| 322 | +* minimal – use as little code as possible that still produces the |
| 323 | + same problem |
| 324 | + |
| 325 | +* complete – provide all parts needed to reproduce the problem. Check |
| 326 | + if you can strip external dependency and still show the problem. The |
| 327 | + less time we spend on reproducing problems the more time we have to |
| 328 | + fix it |
| 329 | + |
| 330 | +* verifiable – test the code you're about to provide to make sure it |
| 331 | + reproduces the problem. Remove all other problems that are not |
| 332 | + related to your request/question. |
0 commit comments