Releases: triton-inference-server/server
Release 1.2.0, corresponding to NGC container 19.05
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.2.0
-
Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
-
Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.
-
The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.
-
The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.
-
The perf_client application now has a -z option to use zero-valued input tensors instead of random values.
-
Improved error reporting of incorrect input/output tensor names for TensorRT models.
-
Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0_ubuntu1604.clients.tar.gz and v1.2.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.
Release 1.1.0, corresponding to NGC container 19.04
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.1.0
-
Client libraries and examples now build with a separate Makefile (a Dockerfile is also included for convenience).
-
Input or output tensors with variable-size dimensions (indicated by -1 in the model configuration) can now represent tensors where the variable dimension has value 0 (zero).
-
Zero-sized input and output tensors are now supported for batching models. This enables the inference server to support models that require inputs and outputs that have shape [ batch-size ].
-
TensorFlow custom operations (C++) can now be built into the inference server. An example and documentation are included in this release.
Client Libraries and Examples
An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.1.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.
Release 1.0.0, corresponding to NGC container 19.03
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.0.0
-
1.0.0 is the first GA, non-beta, release of TensorRT Inference Server. See the README for information on backwards-compatibility guarantees for this and future releases.
-
Added support for stateful models and backends that require multiple inference requests be routed to the same model instance/batch slot. The new sequence batcher provides scheduling and batching capabilities for this class of models.
-
Added GRPC streaming protocol support for inference requests.
-
The HTTP front-end is now asynchronous to enable lower-latency and higher-throughput handling of inference requests.
-
Enhanced perf_client to support stateful models and backends.
Client Libraries and Examples
An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.0.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.
Release 0.11.0 beta, corresponding to NGC container 19.02
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 0.11.0 Beta
-
Variable-size input and output tensor support. Models that support variable-size input tensors and produce variable-size output tensors are now supported in the model configuration by using a dimension size of -1 for those dimensions that can take on any size.
-
String datatype support. For TensorFlow models and custom backends, input and output tensors can contain strings.
-
Improved support for non-GPU systems. The inference server will run correctly on systems that do not contain GPUs and that do not have nvidia-docker or CUDA installed.
Client Libraries and Examples
An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v0.11.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.
Release 0.10.0 beta, corresponding to NGC container 19.01
NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server (TRTIS) provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 0.10.0 Beta
- Custom backend support. TRTIS allows individual models to be implemented with custom backends instead of by a deep-learning framework. With a custom backend a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic batching and other features provided by TRTIS.
Release 0.9.0 beta, corresponding to NGC container 18.12
NVIDIA TensorRT Inference Server 0.9.0 Beta
The NVIDIA TensorRT Inference Server (TRTIS) provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New in 0.9.0 Beta
- TRTIS now monitors the model repository for any change and dynamically reloads the model when necessary, without requiring a server restart. It is now possible to add and remove model versions, add/remove entire models, modify the model configuration, and modify the model labels while the server is running.
- Added a model priority parameter to the model configuration. Currently the model priority controls the CPU thread priority when executing the model and for TensorRT models also controls the CUDA stream priority.
- Fixed a bug in GRPC API: changed the model version parameter from string to int. This is a non-backwards compatible change.
- Added --strict-model-config=false option to allow some model configuration properties to be derived automatically. For some model types, this removes the need to specify the config.pbtxt file.
- Improved performance from an asynchronous GRPC frontend.
Release 0.8.0 beta, corresponding to NGC container 18.11
v0.8.0 Document dependencies needed for client library and examples