|
30 | 30 | NVIDIA TensorRT Inference Server
|
31 | 31 | ================================
|
32 | 32 |
|
33 |
| - **NOTE: You are currently on the r19.07 branch which tracks |
34 |
| - stabilization towards the next release. This branch is not usable |
35 |
| - during stabilization.** |
| 33 | + **NOTICE: The r19.07 branch has converted to using CMake |
| 34 | + to build the server, clients and other artifacts. Read the new |
| 35 | + documentation carefully to understand the new** `build process |
| 36 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/build.html>`_. |
36 | 37 |
|
37 | 38 | .. overview-begin-marker-do-not-remove
|
38 | 39 |
|
| 40 | +The NVIDIA TensorRT Inference Server provides a cloud inferencing |
| 41 | +solution optimized for NVIDIA GPUs. The server provides an inference |
| 42 | +service via an HTTP or GRPC endpoint, allowing remote clients to |
| 43 | +request inferencing for any model being managed by the server. |
| 44 | + |
| 45 | +What's New In 1.4.0 |
| 46 | +------------------- |
| 47 | + |
| 48 | +* Added libtorch as a new backend. PyTorch models manually decorated |
| 49 | + or automatically traced to produce TorchScript can now be run |
| 50 | + directly by the inference server. |
| 51 | + |
| 52 | +* Build system converted from bazel to CMake. The new CMake-based |
| 53 | + build system is more transparent, portable and modular. |
| 54 | + |
| 55 | +* To simplify the creation of custom backends, a Custom Backend SDK |
| 56 | + and improved documentation is now available. |
| 57 | + |
| 58 | +* Improved AsyncRun API in C++ and Python client libraries. |
| 59 | + |
| 60 | +* perf_client can now use user-supplied input data (previously |
| 61 | + perf_client could only use random or zero input data). |
| 62 | + |
| 63 | +* perf_client now reports latency at multiple confidence percentiles |
| 64 | + (p50, p90, p95, p99) as well as a user-supplied percentile that is |
| 65 | + also used to stabilize latency results. |
| 66 | + |
| 67 | +* Improvements to automatic model configuration creation |
| 68 | + (-\\-strict-model-config=false). |
| 69 | + |
| 70 | +* C++ and Python client libraries now allow additional HTTP headers to |
| 71 | + be specified when using the HTTP protocol. |
| 72 | + |
| 73 | +Features |
| 74 | +-------- |
| 75 | + |
| 76 | +* `Multiple framework support |
| 77 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#framework-model-definition>`_. The |
| 78 | + server can manage any number and mix of models (limited by system |
| 79 | + disk and memory resources). Supports TensorRT, TensorFlow GraphDef, |
| 80 | + TensorFlow SavedModel, ONNX, PyTorch, and Caffe2 NetDef model |
| 81 | + formats. Also supports TensorFlow-TensorRT integrated |
| 82 | + models. Variable-size input and output tensors are allowed if |
| 83 | + supported by the framework. See `Capabilities |
| 84 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/capabilities.html#capabilities>`_ |
| 85 | + for detailed support information for each framework. |
| 86 | + |
| 87 | +* `Concurrent model execution support |
| 88 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#instance-groups>`_. Multiple |
| 89 | + models (or multiple instances of the same model) can run |
| 90 | + simultaneously on the same GPU. |
| 91 | + |
| 92 | +* Batching support. For models that support batching, the server can |
| 93 | + accept requests for a batch of inputs and respond with the |
| 94 | + corresponding batch of outputs. The inference server also supports |
| 95 | + multiple `scheduling and batching |
| 96 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#scheduling-and-batching>`_ |
| 97 | + algorithms that combine individual inference requests together to |
| 98 | + improve inference throughput. These scheduling and batching |
| 99 | + decisions are transparent to the client requesting inference. |
| 100 | + |
| 101 | +* `Custom backend support |
| 102 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#custom-backends>`_. The inference server |
| 103 | + allows individual models to be implemented with custom backends |
| 104 | + instead of by a deep-learning framework. With a custom backend a |
| 105 | + model can implement any logic desired, while still benefiting from |
| 106 | + the GPU support, concurrent execution, dynamic batching and other |
| 107 | + features provided by the server. |
| 108 | + |
| 109 | +* `Ensemble support |
| 110 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/models_and_schedulers.html#ensemble-models>`_. An |
| 111 | + ensemble represents a pipeline of one or more models and the |
| 112 | + connection of input and output tensors between those models. A |
| 113 | + single inference request to an ensemble will trigger the execution |
| 114 | + of the entire pipeline. |
| 115 | + |
| 116 | +* Multi-GPU support. The server can distribute inferencing across all |
| 117 | + system GPUs. |
| 118 | + |
| 119 | +* The inference server `monitors the model repository |
| 120 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#modifying-the-model-repository>`_ |
| 121 | + for any change and dynamically reloads the model(s) when necessary, |
| 122 | + without requiring a server restart. Models and model versions can be |
| 123 | + added and removed, and model configurations can be modified while |
| 124 | + the server is running. |
| 125 | + |
| 126 | +* `Model repositories |
| 127 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_repository.html#>`_ |
| 128 | + may reside on a locally accessible file system (e.g. NFS) or in |
| 129 | + Google Cloud Storage. |
| 130 | + |
| 131 | +* Readiness and liveness `health endpoints |
| 132 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/http_grpc_api.html#health>`_ |
| 133 | + suitable for any orchestration or deployment framework, such as |
| 134 | + Kubernetes. |
| 135 | + |
| 136 | +* `Metrics |
| 137 | + <https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/metrics.html>`_ |
| 138 | + indicating GPU utilization, server throughput, and server latency. |
| 139 | + |
39 | 140 | .. overview-end-marker-do-not-remove
|
40 | 141 |
|
| 142 | +The current release of the TensorRT Inference Server is 1.4.0 and |
| 143 | +corresponds to the 19.07 release of the tensorrtserver container on |
| 144 | +`NVIDIA GPU Cloud (NGC) <https://ngc.nvidia.com>`_. The branch for |
| 145 | +this release is `r19.07 |
| 146 | +<https://github.com/NVIDIA/tensorrt-inference-server/tree/r19.07>`_. |
| 147 | + |
| 148 | +Backwards Compatibility |
| 149 | +----------------------- |
| 150 | + |
| 151 | +Continuing in version 1.4.0 the following interfaces maintain |
| 152 | +backwards compatibilty with the 1.0.0 release. If you have model |
| 153 | +configuration files, custom backends, or clients that use the |
| 154 | +inference server HTTP or GRPC APIs (either directly or through the |
| 155 | +client libraries) from releases prior to 1.0.0 (19.03) you should edit |
| 156 | +and rebuild those as necessary to match the version 1.0.0 APIs. |
| 157 | + |
| 158 | +These inferfaces will maintain backwards compatibility for all future |
| 159 | +1.x.y releases (see below for exceptions): |
| 160 | + |
| 161 | +* Model configuration as defined in `model_config.proto |
| 162 | + <https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/model_config.proto>`_. |
| 163 | + |
| 164 | +* The inference server HTTP and GRPC APIs as defined in `api.proto |
| 165 | + <https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/api.proto>`_ |
| 166 | + and `grpc_service.proto |
| 167 | + <https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/grpc_service.proto>`_. |
| 168 | + |
| 169 | +* The custom backend interface as defined in `custom.h |
| 170 | + <https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/backends/custom/custom.h>`_. |
| 171 | + |
| 172 | +As new features are introduced they may temporarily have beta status |
| 173 | +where they are subject to change in non-backwards-compatible |
| 174 | +ways. When they exit beta they will conform to the |
| 175 | +backwards-compatibility guarantees described above. Currently the |
| 176 | +following features are in beta: |
| 177 | + |
| 178 | +* In the model configuration defined in `model_config.proto |
| 179 | + <https://github.com/NVIDIA/tensorrt-inference-server/blob/master/src/core/model_config.proto>`_ |
| 180 | + the sections related to model ensembling are currently in beta. In |
| 181 | + particular, the ModelEnsembling message will potentially undergo |
| 182 | + non-backwards-compatible changes. |
| 183 | + |
| 184 | + |
| 185 | +Documentation |
| 186 | +------------- |
| 187 | + |
| 188 | +The User Guide, Developer Guide, and API Reference `documentation |
| 189 | +<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/index.html>`_ |
| 190 | +provide guidance on installing, building and running the latest |
| 191 | +release of the TensorRT Inference Server. |
| 192 | + |
| 193 | +You can also view the documentation for the `master branch |
| 194 | +<https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/index.html>`_ |
| 195 | +and for `earlier releases |
| 196 | +<https://docs.nvidia.com/deeplearning/sdk/inference-server-archived/index.html>`_. |
| 197 | + |
| 198 | +READMEs for deployment examples can be found in subdirectories of |
| 199 | +deploy/, for example, `deploy/single_server/README.rst |
| 200 | +<https://github.com/NVIDIA/tensorrt-inference-server/tree/master/deploy/single_server/README.rst>`_. |
| 201 | + |
| 202 | +The `Release Notes |
| 203 | +<https://docs.nvidia.com/deeplearning/sdk/inference-release-notes/index.html>`_ |
| 204 | +and `Support Matrix |
| 205 | +<https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html>`_ |
| 206 | +indicate the required versions of the NVIDIA Driver and CUDA, and also |
| 207 | +describe which GPUs are supported by the inference server. |
| 208 | + |
| 209 | +Other Documentation |
| 210 | +^^^^^^^^^^^^^^^^^^^ |
| 211 | + |
| 212 | +* `Maximizing Utilization for Data Center Inference with TensorRT |
| 213 | + Inference Server |
| 214 | + <https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server>`_. |
| 215 | + |
| 216 | +* `NVIDIA TensorRT Inference Server Boosts Deep Learning Inference |
| 217 | + <https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/>`_. |
| 218 | + |
| 219 | +* `GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT |
| 220 | + Inference Server and Kubeflow |
| 221 | + <https://www.kubeflow.org/blog/nvidia_tensorrt/>`_. |
| 222 | + |
| 223 | +Contributing |
| 224 | +------------ |
| 225 | + |
| 226 | +Contributions to TensorRT Inference Server are more than welcome. To |
| 227 | +contribute make a pull request and follow the guidelines outlined in |
| 228 | +the `Contributing <CONTRIBUTING.md>`_ document. |
| 229 | + |
| 230 | +Reporting problems, asking questions |
| 231 | +------------------------------------ |
| 232 | + |
| 233 | +We appreciate any feedback, questions or bug reporting regarding this |
| 234 | +project. When help with code is needed, follow the process outlined in |
| 235 | +the Stack Overflow (https://stackoverflow.com/help/mcve) |
| 236 | +document. Ensure posted examples are: |
| 237 | + |
| 238 | +* minimal – use as little code as possible that still produces the |
| 239 | + same problem |
| 240 | + |
| 241 | +* complete – provide all parts needed to reproduce the problem. Check |
| 242 | + if you can strip external dependency and still show the problem. The |
| 243 | + less time we spend on reproducing problems the more time we have to |
| 244 | + fix it |
| 245 | + |
| 246 | +* verifiable – test the code you're about to provide to make sure it |
| 247 | + reproduces the problem. Remove all other problems that are not |
| 248 | + related to your request/question. |
| 249 | + |
41 | 250 | .. |License| image:: https://img.shields.io/badge/License-BSD3-lightgrey.svg
|
42 | 251 | :target: https://opensource.org/licenses/BSD-3-Clause
|
0 commit comments