|
30 | 30 |
|
31 | 31 | # Triton Inference Server
|
32 | 32 |
|
33 |
| -**NOTE: You are currently on the r22.01 branch which tracks stabilization |
34 |
| -towards the next release. This branch is not usable during stabilization.** |
| 33 | +Triton Inference Server provides a cloud and edge inferencing solution |
| 34 | +optimized for both CPUs and GPUs. Triton supports an HTTP/REST and |
| 35 | +GRPC protocol that allows remote clients to request inferencing for |
| 36 | +any model being managed by the server. For edge deployments, Triton is |
| 37 | +available as a shared library with a C API that allows the full |
| 38 | +functionality of Triton to be included directly in an |
| 39 | +application. |
| 40 | + |
| 41 | +## What's New in 2.18.0 |
| 42 | + |
| 43 | +* Triton CPU-only build now supports [TensorFlow2 backend for Linux |
| 44 | + x86](docs/build.md#cpu-only-container). |
| 45 | + |
| 46 | +* [Implicit state management](docs/architecture.md#implicit-state-management) |
| 47 | + can be used for ONNX Runtime and TensorRT backends. |
| 48 | + |
| 49 | +* [State initialization](docs/architecture.md#state-initialization) from a |
| 50 | + constant is now supported in Implicit State management. |
| 51 | + |
| 52 | +* PyTorch and Tensorflow models now support batching on Inferentia. |
| 53 | + |
| 54 | +* PyTorch and Python backends are now supported on Jetson. |
| 55 | + |
| 56 | +* ARM Support has been added for the Performance Analyzer and Model Analyzer. |
| 57 | + |
| 58 | +## Features |
| 59 | + |
| 60 | +* [Deep learning |
| 61 | + frameworks](https://github.com/triton-inference-server/backend). |
| 62 | + Triton supports TensorRT, TensorFlow GraphDef, TensorFlow |
| 63 | + SavedModel, ONNX, PyTorch TorchScript and OpenVINO model |
| 64 | + formats. Both TensorFlow 1.x and TensorFlow 2.x are |
| 65 | + supported. Triton also supports TensorFlow-TensorRT, ONNX-TensorRT |
| 66 | + and PyTorch-TensorRT integrated models. |
| 67 | + |
| 68 | +* [Machine learning |
| 69 | + frameworks](https://github.com/triton-inference-server/fil_backend). |
| 70 | + Triton supports popular machine learning frameworks such as XGBoost, |
| 71 | + LightGBM, Scikit-Learn and cuML using the [RAPIDS Forest Inference |
| 72 | + Library](https://medium.com/rapids-ai/rapids-forest-inference-library-prediction-at-100-million-rows-per-second-19558890bc35). |
| 73 | + |
| 74 | +* [Concurrent model |
| 75 | + execution](docs/architecture.md#concurrent-model-execution). Triton |
| 76 | + can simultaneously run multiple models (or multiple instances of the |
| 77 | + same model) using the same or different deep-learning and |
| 78 | + machine-learning frameworks. |
| 79 | + |
| 80 | +* [Dynamic batching](docs/architecture.md#models-and-schedulers). For |
| 81 | + models that support batching, Triton implements multiple scheduling |
| 82 | + and batching algorithms that combine individual inference requests |
| 83 | + together to improve inference throughput. These scheduling and |
| 84 | + batching decisions are transparent to the client requesting |
| 85 | + inference. |
| 86 | + |
| 87 | +* [Extensible |
| 88 | + backends](https://github.com/triton-inference-server/backend). In |
| 89 | + addition to deep-learning frameworks, Triton provides a *backend |
| 90 | + API* that allows Triton to be extended with any model execution |
| 91 | + logic implemented in |
| 92 | + [Python](https://github.com/triton-inference-server/python_backend) |
| 93 | + or |
| 94 | + [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api), |
| 95 | + while still benefiting from full CPU and GPU support, concurrent |
| 96 | + execution, dynamic batching and other features provided by Triton. |
| 97 | + |
| 98 | +* Model pipelines using |
| 99 | + [Ensembling](docs/architecture.md#ensemble-models) or [Business |
| 100 | + Logic Scripting |
| 101 | + (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting). |
| 102 | + A Triton *ensemble* represents a pipeline of one or more models and |
| 103 | + the connection of input and output tensors between those |
| 104 | + models. *BLS* allows a pipeline along with extra business logic to |
| 105 | + be represented in Python. In both cases a single inference request |
| 106 | + will trigger the execution of the entire pipeline. |
| 107 | + |
| 108 | +* [HTTP/REST and GRPC inference |
| 109 | + protocols](docs/inference_protocols.md) based on the community |
| 110 | + developed [KServe |
| 111 | + protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2). |
| 112 | + |
| 113 | +* A [C API](docs/inference_protocols.md#c-api) allows Triton to be |
| 114 | + linked directly into your application for edge and other in-process |
| 115 | + use cases. |
| 116 | + |
| 117 | +* [Metrics](docs/metrics.md) indicating GPU utilization, server |
| 118 | + throughput, and server latency. The metrics are provided in |
| 119 | + Prometheus data format. |
| 120 | + |
| 121 | +## Documentation |
| 122 | + |
| 123 | +[Triton Architecture](docs/architecture.md) gives a high-level |
| 124 | +overview of the structure and capabilities of the inference |
| 125 | +server. There is also an [FAQ](docs/faq.md). Additional documentation |
| 126 | +is divided into [*user*](#user-documentation) and |
| 127 | +[*developer*](#developer-documentation) sections. The *user* |
| 128 | +documentation describes how to use Triton as an inference solution, |
| 129 | +including information on how to configure Triton, how to organize and |
| 130 | +configure your models, how to use the C++ and Python clients, etc. The |
| 131 | +*developer* documentation describes how to build and test Triton and |
| 132 | +also how Triton can be extended with new functionality. |
| 133 | + |
| 134 | +The Triton [Release |
| 135 | +Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
| 136 | +and [Support |
| 137 | +Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
| 138 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 139 | +describe supported GPUs. |
| 140 | + |
| 141 | +### User Documentation |
| 142 | + |
| 143 | +* [QuickStart](docs/quickstart.md) |
| 144 | + * [Install Triton](docs/quickstart.md#install-triton-docker-image) |
| 145 | + * [Create Model Repository](docs/quickstart.md#create-a-model-repository) |
| 146 | + * [Run Triton](docs/quickstart.md#run-triton) |
| 147 | +* [Model Repository](docs/model_repository.md) |
| 148 | + * [Cloud Storage](docs/model_repository.md#model-repository-locations) |
| 149 | + * [File Organization](docs/model_repository.md#model-files) |
| 150 | + * [Model Versioning](docs/model_repository.md#model-versions) |
| 151 | +* [Model Configuration](docs/model_configuration.md) |
| 152 | + * [Required Model Configuration](docs/model_configuration.md#minimal-model-configuration) |
| 153 | + * [Maximum Batch Size - Batching and Non-Batching Models](docs/model_configuration.md#maximum-batch-size) |
| 154 | + * [Input and Output Tensors](docs/model_configuration.md#inputs-and-outputs) |
| 155 | + * [Tensor Datatypes](docs/model_configuration.md#datatypes) |
| 156 | + * [Tensor Reshape](docs/model_configuration.md#reshape) |
| 157 | + * [Shape Tensor](docs/model_configuration.md#shape-tensors) |
| 158 | + * [Auto-Generate Required Model Configuration](docs/model_configuration.md#auto-generated-model-configuration) |
| 159 | + * [Version Policy](docs/model_configuration.md#version-policy) |
| 160 | + * [Instance Groups](docs/model_configuration.md#instance-groups) |
| 161 | + * [Specifying Multiple Model Instances](docs/model_configuration.md#multiple-model-instances) |
| 162 | + * [CPU and GPU Instances](docs/model_configuration.md#cpu-model-instance) |
| 163 | + * [Configuring Rate Limiter](docs/model_configuration.md#rate-limiter-configuration) |
| 164 | + * [Optimization Settings](docs/model_configuration.md#optimization_policy) |
| 165 | + * [Framework-Specific Optimization](docs/optimization.md#framework-specific-optimization) |
| 166 | + * [ONNX-TensorRT](docs/optimization.md#onnx-with-tensorrt-optimization) |
| 167 | + * [ONNX-OpenVINO](docs/optimization.md#onnx-with-openvino-optimization) |
| 168 | + * [TensorFlow-TensorRT](docs/optimization.md#tensorflow-with-tensorrt-optimization) |
| 169 | + * [TensorFlow-Mixed-Precision](docs/optimization.md#tensorflow-automatic-fp16-optimization) |
| 170 | + * [NUMA Optimization](docs/optimization.md#numa-optimization) |
| 171 | + * [Scheduling and Batching](docs/model_configuration.md#scheduling-and-batching) |
| 172 | + * [Default Scheduler - Non-Batching](docs/model_configuration.md#default-scheduler) |
| 173 | + * [Dynamic Batcher](docs/model_configuration.md#dynamic-batcher) |
| 174 | + * [How to Configure Dynamic Batcher](docs/model_configuration.md#recommended-configuration-process) |
| 175 | + * [Delayed Batching](docs/model_configuration.md#delayed-batching) |
| 176 | + * [Preferred Batch Size](docs/model_configuration.md#preferred-batch-sizes) |
| 177 | + * [Preserving Request Ordering](docs/model_configuration.md#preserve-ordering) |
| 178 | + * [Priority Levels](docs/model_configuration.md#priority-levels) |
| 179 | + * [Queuing Policies](docs/model_configuration.md#queue-policy) |
| 180 | + * [Ragged Batching](docs/ragged_batching.md) |
| 181 | + * [Sequence Batcher](docs/model_configuration.md#sequence-batcher) |
| 182 | + * [Stateful Models](docs/architecture.md#stateful-models) |
| 183 | + * [Control Inputs](docs/architecture.md#control-inputs) |
| 184 | + * [Implicit State - Stateful Inference Using a Stateless Model](docs/architecture.md#implicit-state-management) |
| 185 | + * [Sequence Scheduling Strategies](docs/architecture.md#scheduling-strateties) |
| 186 | + * [Direct](docs/architecture.md#direct) |
| 187 | + * [Oldest](docs/architecture.md#oldest) |
| 188 | + * [Rate Limiter](docs/rate_limiter.md) |
| 189 | + * [Model Warmup](docs/model_configuration.md#model-warmup) |
| 190 | + * [Inference Request/Response Cache](docs/model_configuration.md#response-cache) |
| 191 | +* Model Pipeline |
| 192 | + * [Model Ensemble](docs/architecture.md#ensemble-models) |
| 193 | + * [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
| 194 | +* [Model Management](docs/model_management.md) |
| 195 | + * [Explicit Model Loading and Unloading](docs/model_management.md#model-control-mode-explicit) |
| 196 | + * [Modifying the Model Repository](docs/model_management.md#modifying-the-model-repository) |
| 197 | +* [Metrics](docs/metrics.md) |
| 198 | +* [Framework Custom Operations](docs/custom_operations.md) |
| 199 | + * [TensorRT](docs/custom_operations.md#tensorrt) |
| 200 | + * [TensorFlow](docs/custom_operations.md#tensorflow) |
| 201 | + * [PyTorch](docs/custom_operations.md#pytorch) |
| 202 | + * [ONNX](docs/custom_operations.md#onnx) |
| 203 | +* [Client Libraries and Examples](https://github.com/triton-inference-server/client) |
| 204 | + * [C++ HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) |
| 205 | + * [Python HTTP/GRPC Libraries](https://github.com/triton-inference-server/client#client-library-apis) |
| 206 | + * [Java HTTP Library](https://github.com/triton-inference-server/client/src/java) |
| 207 | + * GRPC Generated Libraries |
| 208 | + * [go](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/go) |
| 209 | + * [Java/Scala](https://github.com/triton-inference-server/client/tree/main/src/grpc_generated/java) |
| 210 | +* [Performance Analysis](docs/optimization.md) |
| 211 | + * [Model Analyzer](docs/model_analyzer.md) |
| 212 | + * [Performance Analyzer](docs/perf_analyzer.md) |
| 213 | + * [Inference Request Tracing](docs/trace.md) |
| 214 | +* [Jetson and JetPack](docs/jetson.md) |
| 215 | + |
| 216 | +The [quickstart](docs/quickstart.md) walks you through all the steps |
| 217 | +required to install and run Triton with an example image |
| 218 | +classification model and then use an example client application to |
| 219 | +perform inferencing using that model. The quickstart also demonstrates |
| 220 | +how [Triton supports both GPU systems and CPU-only |
| 221 | +systems](docs/quickstart.md#run-triton). |
| 222 | + |
| 223 | +The first step in using Triton to serve your models is to place one or |
| 224 | +more models into a [model |
| 225 | +repository](docs/model_repository.md). Optionally, depending on the type |
| 226 | +of the model and on what Triton capabilities you want to enable for |
| 227 | +the model, you may need to create a [model |
| 228 | +configuration](docs/model_configuration.md) for the model. If your |
| 229 | +model has [custom operations](docs/custom_operations.md) you will need |
| 230 | +to make sure they are loaded correctly by Triton. |
| 231 | + |
| 232 | +After you have your model(s) available in Triton, you will want to |
| 233 | +send inference and other requests to Triton from your *client* |
| 234 | +application. The [Python and C++ client |
| 235 | +libraries](https://github.com/triton-inference-server/client) provide |
| 236 | +APIs to simplify this communication. There are also a large number of |
| 237 | +[client examples](https://github.com/triton-inference-server/client) |
| 238 | +that demonstrate how to use the libraries. You can also send |
| 239 | +HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based |
| 240 | +protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or |
| 241 | +[generate a GRPC client for many other |
| 242 | +languages](https://github.com/triton-inference-server/client). |
| 243 | + |
| 244 | +Understanding and [optimizing performance](docs/optimization.md) is an |
| 245 | +important part of deploying your models. The Triton project provides |
| 246 | +the [Performance Analyzer](docs/perf_analyzer.md) and the [Model |
| 247 | +Analyzer](docs/model_analyzer.md) to help your optimization |
| 248 | +efforts. Specifically, you will want to optimize [scheduling and |
| 249 | +batching](docs/architecture.md#models-and-schedulers) and [model |
| 250 | +instances](docs/model_configuration.md#instance-groups) appropriately |
| 251 | +for each model. You can also enable cross-model prioritization using |
| 252 | +the [rate limiter](docs/rate_limiter.md) which manages the rate at |
| 253 | +which requests are scheduled on model instances. You may also want to |
| 254 | +consider combining multiple models and pre/post-processing into a |
| 255 | +pipeline using [ensembling](docs/architecture.md#ensemble-models) or |
| 256 | +[Business Logic Scripting |
| 257 | +(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting). A |
| 258 | +[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize |
| 259 | +and monitor aggregate inference metrics. |
| 260 | + |
| 261 | +NVIDIA publishes a number of [deep learning |
| 262 | +examples](https://github.com/NVIDIA/DeepLearningExamples) that use |
| 263 | +Triton. |
| 264 | + |
| 265 | +As part of your deployment strategy you may want to [explicitly manage |
| 266 | +what models are available by loading and unloading |
| 267 | +models](docs/model_management.md) from a running Triton server. If you |
| 268 | +are using Kubernetes for deployment there are simple examples of how |
| 269 | +to deploy Triton using Kubernetes and Helm: |
| 270 | +[GCP](deploy/gcp/README.md), [AWS](deploy/aws/README.md), and [NVIDIA |
| 271 | +FleetCommand](deploy/fleetcommand/README.md) |
| 272 | + |
| 273 | +The [version 1 to version 2 migration |
| 274 | +information](docs/v1_to_v2.md) is helpful if you are moving to |
| 275 | +version 2 of Triton from previously using version 1. |
| 276 | + |
| 277 | +### Developer Documentation |
| 278 | + |
| 279 | +* [Build](docs/build.md) |
| 280 | +* [Protocols and APIs](docs/inference_protocols.md). |
| 281 | +* [Backends](https://github.com/triton-inference-server/backend) |
| 282 | +* [Repository Agents](docs/repository_agents.md) |
| 283 | +* [Test](docs/test.md) |
| 284 | + |
| 285 | +Triton can be [built using |
| 286 | +Docker](docs/build.md#building-triton-with-docker) or [built without |
| 287 | +Docker](docs/build.md#building-triton-without-docker). After building |
| 288 | +you should [test Triton](docs/test.md). |
| 289 | + |
| 290 | +It is also possible to [create a Docker image containing a customized |
| 291 | +Triton](docs/compose.md) that contains only a subset of the backends. |
| 292 | + |
| 293 | +The Triton project also provides [client libraries for Python and |
| 294 | +C++](https://github.com/triton-inference-server/client) that make it |
| 295 | +easy to communicate with the server. There are also a large number of |
| 296 | +[example clients](https://github.com/triton-inference-server/client) |
| 297 | +that demonstrate how to use the libraries. You can also develop your |
| 298 | +own clients that directly communicate with Triton using [HTTP/REST or |
| 299 | +GRPC protocols](docs/inference_protocols.md). There is also a [C |
| 300 | +API](docs/inference_protocols.md) that allows Triton to be linked |
| 301 | +directly into your application. |
| 302 | + |
| 303 | +A [Triton backend](https://github.com/triton-inference-server/backend) |
| 304 | +is the implementation that executes a model. A backend can interface |
| 305 | +with a deep learning framework, like PyTorch, TensorFlow, TensorRT or |
| 306 | +ONNX Runtime; or it can interface with a data processing framework |
| 307 | +like [DALI](https://github.com/triton-inference-server/dali_backend); |
| 308 | +or you can extend Triton by [writing your own |
| 309 | +backend](https://github.com/triton-inference-server/backend) in either |
| 310 | +[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api) |
| 311 | +or |
| 312 | +[Python](https://github.com/triton-inference-server/python_backend). |
| 313 | + |
| 314 | +A [Triton repository agent](docs/repository_agents.md) extends Triton |
| 315 | +with new functionality that operates when a model is loaded or |
| 316 | +unloaded. You can introduce your own code to perform authentication, |
| 317 | +decryption, conversion, or similar operations when a model is loaded. |
| 318 | + |
| 319 | +## Papers and Presentation |
| 320 | + |
| 321 | +* [Maximizing Deep Learning Inference Performance with NVIDIA Model |
| 322 | + Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/). |
| 323 | + |
| 324 | +* [High-Performance Inferencing at Scale Using the TensorRT Inference |
| 325 | + Server](https://developer.nvidia.com/gtc/2020/video/s22418). |
| 326 | + |
| 327 | +* [Accelerate and Autoscale Deep Learning Inference on GPUs with |
| 328 | + KFServing](https://developer.nvidia.com/gtc/2020/video/s22459). |
| 329 | + |
| 330 | +* [Deep into Triton Inference Server: BERT Practical Deployment on |
| 331 | + NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736). |
| 332 | + |
| 333 | +* [Maximizing Utilization for Data Center Inference with TensorRT |
| 334 | + Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server). |
| 335 | + |
| 336 | +* [NVIDIA TensorRT Inference Server Boosts Deep Learning |
| 337 | + Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/). |
| 338 | + |
| 339 | +* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 340 | + Inference Server and |
| 341 | + Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/). |
| 342 | + |
| 343 | +* [Deploying NVIDIA Triton at Scale with MIG and Kubernetes](https://developer.nvidia.com/blog/deploying-nvidia-triton-at-scale-with-mig-and-kubernetes/). |
| 344 | + |
| 345 | +## Contributing |
| 346 | + |
| 347 | +Contributions to Triton Inference Server are more than welcome. To |
| 348 | +contribute make a pull request and follow the guidelines outlined in |
| 349 | +[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client, |
| 350 | +example or similar contribution that is not modifying the core of |
| 351 | +Triton, then you should file a PR in the [contrib |
| 352 | +repo](https://github.com/triton-inference-server/contrib). |
| 353 | + |
| 354 | +## Reporting problems, asking questions |
| 355 | + |
| 356 | +We appreciate any feedback, questions or bug reporting regarding this |
| 357 | +project. When help with code is needed, follow the process outlined in |
| 358 | +the Stack Overflow (<https://stackoverflow.com/help/mcve>) |
| 359 | +document. Ensure posted examples are: |
| 360 | + |
| 361 | +* minimal – use as little code as possible that still produces the |
| 362 | + same problem |
| 363 | + |
| 364 | +* complete – provide all parts needed to reproduce the problem. Check |
| 365 | + if you can strip external dependency and still show the problem. The |
| 366 | + less time we spend on reproducing problems the more time we have to |
| 367 | + fix it |
| 368 | + |
| 369 | +* verifiable – test the code you're about to provide to make sure it |
| 370 | + reproduces the problem. Remove all other problems that are not |
| 371 | + related to your request/question. |
0 commit comments