Update README and versions for 21.03 release (#2664)

deadeyegoodwin · web-flow · commit f5b001cd6d1a · 2021-03-26T17:15:08.000-07:00
diff --git a/README.md b/README.md
@@ -30,5 +30,296 @@
 
 # Triton Inference Server
 
-**NOTE: You are currently on the r21.03 branch which tracks stabilization
-towards the next release. This branch is not usable during stabilization.**
+Triton Inference Server provides a cloud and edge inferencing solution
+optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
+GRPC protocol that allows remote clients to request inferencing for
+any model being managed by the server. For edge deployments, Triton is
+available as a shared library with a C API that allows the full
+functionality of Triton to be included directly in an
+application.
+
+## What's New in 2.8.0
+
+* The Windows build now supports ONNX models.
+
+* Repository agent is a new extensibility C API added to Triton that
+  allows implementation of custom authentication, decryption,
+  conversion, or similar operations when a model is loaded. See
+  https://github.com/triton-inference-server/server/blob/master/docs/repository_agents.md
+
+* An OpenVINO backend is added to Triton to enable the execution of
+  OpenVINO models on CPUs.
+  https://github.com/triton-inference-server/openvino_backend
+
+* The PyTorch backend is now maintained in its own repository:
+  https://github.com/triton-inference-server/pytorch_backend
+
+* The ONNX Runtime backend is now maintained in its own repository:
+  https://github.com/triton-inference-server/onnxruntime_backend
+
+* The Jetson release of Triton now supports the system shared-memory
+  protocol between clients and the Triton server.
+
+* SSL/TLS Mutual Authentication support is added to the GRPC client
+  library.
+
+* A new Model Configuration option, "gather_kernel_buffer_threshold",
+  can be specified to instruct Triton to use a CUDA kernel to gather
+  inputs buffers onto the GPU. Using this option can improve inference
+  performance for some models.
+
+* The Python client libraries have been improved to more efficiently
+  create numpy arrays for input and output tensors.
+
+* The client libraries examples have been improved to more clearly
+  describe how string and byte-blob tensors are supported by the
+  Python Client API.
+  https://github.com/triton-inference-server/server/blob/master/docs/client_examples.md
+
+## Features
+
+* [Multiple deep-learning
+  frameworks](https://github.com/triton-inference-server/backend). Triton
+  can manage any number and mix of models (limited by system disk and
+  memory resources). Triton supports TensorRT, TensorFlow GraphDef,
+  TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
+  formats. Both TensorFlow 1.x and TensorFlow 2.x are
+  supported. Triton also supports TensorFlow-TensorRT and
+  ONNX-TensorRT integrated models.
+
+* [Concurrent model
+  execution](docs/architecture.md#concurrent-model-execution). Multiple
+  models (or multiple instances of the same model) can run
+  simultaneously on the same GPU or on multiple GPUs.
+
+* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
+  models that support batching, Triton implements multiple scheduling
+  and batching algorithms that combine individual inference requests
+  together to improve inference throughput. These scheduling and
+  batching decisions are transparent to the client requesting
+  inference.
+
+* [Extensible
+  backends](https://github.com/triton-inference-server/backend). In
+  addition to deep-learning frameworks, Triton provides a *backend
+  API* that allows Triton to be extended with any model execution
+  logic implemented in
+  [Python](https://github.com/triton-inference-server/python_backend)
+  or
+  [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
+  while still benefiting from the CPU and GPU support, concurrent
+  execution, dynamic batching and other features provided by Triton.
+
+* [Model pipelines](docs/architecture.md#ensemble-models). Triton
+  *ensembles* represents a pipeline of one or more models and the
+  connection of input and output tensors between those models. A
+  single inference request to an ensemble will trigger the execution
+  of the entire pipeline.
+
+* [HTTP/REST and GRPC inference
+  protocols](docs/inference_protocols.md) based on the community
+  developed [KFServing
+  protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
+
+* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
+  linked directly into your application for edge and other in-process
+  use cases.
+
+* [Metrics](docs/metrics.md) indicating GPU utilization, server
+  throughput, and server latency. The metrics are provided in
+  Prometheus data format.
+
+## Documentation
+
+**The master branch documentation tracks the upcoming,
+under-development release and so may not be accurate for the current
+release of Triton. See the [r21.03
+documentation](https://github.com/triton-inference-server/server/tree/r21.03#documentation)
+for the current release.**
+
+[Triton Architecture](docs/architecture.md) gives a high-level
+overview of the structure and capabilities of the inference
+server. There is also an [FAQ](docs/faq.md). Additional documentation
+is divided into [*user*](#user-documentation) and
+[*developer*](#developer-documentation) sections. The *user*
+documentation describes how to use Triton as an inference solution,
+including information on how to configure Triton, how to organize and
+configure your models, how to use the C++ and Python clients, etc. The
+*developer* documentation describes how to build and test Triton and
+also how Triton can be extended with new functionality.
+
+The Triton [Release
+Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
+and [Support
+Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
+indicate the required versions of the NVIDIA Driver and CUDA, and also
+describe supported GPUs.
+
+### User Documentation
+
+- [QuickStart](docs/quickstart.md)
+  - [Install](docs/quickstart.md#install-triton-docker-image)
+  - [Run](docs/quickstart.md#run-triton)
+- [Model Repository](docs/model_repository.md)
+- [Model Configuration](docs/model_configuration.md)
+- [Model Management](docs/model_management.md)
+- [Custom Operations](docs/custom_operations.md)
+- [Client Libraries](docs/client_libraries.md)
+- [Client Examples](docs/client_examples.md)
+- [Optimization](docs/optimization.md)
+  - [Model Analyzer](docs/model_analyzer.md)
+  - [Performance Analyzer](docs/perf_analyzer.md)
+- [Metrics](docs/metrics.md)
+
+The [quickstart](docs/quickstart.md) walks you through all the steps
+required to install and run Triton with an example image
+classification model and then use an example client application to
+perform inferencing using that model. The quickstart also demonstrates
+how [Triton supports both GPU systems and CPU-only
+systems](docs/quickstart.md#run-triton).
+
+The first step in using Triton to serve your models is to place one or
+more models into a [model
+repository](docs/model_repository.md). Optionally, depending on the type
+of the model and on what Triton capabilities you want to enable for
+the model, you may need to create a [model
+configuration](docs/model_configuration.md) for the model.  If your
+model has [custom operations](docs/custom_operations.md) you will need
+to make sure they are loaded correctly by Triton.
+
+After you have your model(s) available in Triton, you will want to
+send inference and other requests to Triton from your *client*
+application. The [Python and C++ client
+libraries](docs/client_libraries.md) provide
+[APIs](docs/client_libraries.md#client-library-apis) to simplify this
+communication. There are also a large number of [client
+examples](docs/client_examples.md) that demonstrate how to use the
+libraries.  You can also send HTTP/REST requests directly to Triton
+using the [HTTP/REST JSON-based
+protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
+[generate a GRPC client for many other
+languages](docs/client_libraries.md).
+
+Understanding and [optimizing performance](docs/optimization.md) is an
+important part of deploying your models. The Triton project provides
+the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
+Analyzer](docs/model_analyzer.md) to help your optimization
+efforts. Specifically, you will want to optimize [scheduling and
+batching](docs/architecture.md#models-and-schedulers) and [model
+instances](docs/model_configuration.md#instance-groups) appropriately
+for each model. You may also want to consider [ensembling multiple
+models and pre/post-processing](docs/architecture.md#ensemble-models)
+into a pipeline. In some cases you may find [individual inference
+request trace data](docs/trace.md) useful when optimizing. A
+[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
+and monitor aggregate inference metrics.
+
+NVIDIA publishes a number of [deep learning
+examples](https://github.com/NVIDIA/DeepLearningExamples) that use
+Triton.
+
+As part of your deployment strategy you may want to [explicitly manage
+what models are available by loading and unloading
+models](docs/model_management.md) from a running Triton server. If you
+are using Kubernetes for deployment a simple example of how to [deploy
+Triton using Kubernetes and Helm](deploy/single_server/README.md) may
+be helpful.
+
+The [version 1 to version 2 migration
+information](docs/v1_to_v2.md) is helpful if you are moving to
+version 2 of Triton from previously using version 1.
+
+### Developer Documentation
+
+- [Build](docs/build.md)
+- [Protocols and APIs](docs/inference_protocols.md).
+- [Backends](https://github.com/triton-inference-server/backend)
+- [Repository Agents](docs/repository_agents.md)
+- [Test](docs/test.md)
+
+Triton can be [built using
+Docker](docs/build.md#building-triton-with-docker) or [built without
+Docker](docs/build.md#building-triton-without-docker). After building
+you should [test Triton](docs/test.md).
+
+Starting with the r20.10 release, it is also possible to [create a
+Docker image containing a customized Triton](docs/compose.md) that
+contains only a subset of the backends.
+
+The Triton project also provides [client libraries for Python and
+C++](docs/client_libraries.md) that make it easy to communicate with
+the server. There are also a large number of [example
+clients](docs/client_examples.md) that demonstrate how to use the
+libraries. You can also develop your own clients that directly
+communicate with Triton using [HTTP/REST or GRPC
+protocols](docs/inference_protocols.md). There is also a [C
+API](docs/inference_protocols.md) that allows Triton to be linked
+directly into your application.
+
+A [Triton backend](https://github.com/triton-inference-server/backend)
+is the implementation that executes a model. A backend can interface
+with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
+ONNX Runtime; or it can interface with a data processing framework
+like [DALI](https://github.com/triton-inference-server/dali_backend);
+or you can extend Triton by [writing your own
+backend](https://github.com/triton-inference-server/backend) in either
+[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
+or
+[Python](https://github.com/triton-inference-server/python_backend).
+
+A [Triton repository agent](docs/repository_agents.md) extends Triton
+with new functionality that operates when a model is loaded or
+unloaded. You can introduce your own code to perform authentication,
+decryption, conversion, or similar operations when a model is loaded.
+
+## Papers and Presentation
+
+* [Maximizing Deep Learning Inference Performance with NVIDIA Model
+  Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
+
+* [High-Performance Inferencing at Scale Using the TensorRT Inference
+  Server](https://developer.nvidia.com/gtc/2020/video/s22418).
+
+* [Accelerate and Autoscale Deep Learning Inference on GPUs with
+  KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
+
+* [Deep into Triton Inference Server: BERT Practical Deployment on
+  NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
+
+* [Maximizing Utilization for Data Center Inference with TensorRT
+  Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
+
+* [NVIDIA TensorRT Inference Server Boosts Deep Learning
+  Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
+
+* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
+  Inference Server and
+  Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
+
+## Contributing
+
+Contributions to Triton Inference Server are more than welcome. To
+contribute make a pull request and follow the guidelines outlined in
+[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
+example or similar contribution that is not modifying the core of
+Triton, then you should file a PR in the [contrib
+repo](https://github.com/triton-inference-server/contrib).
+
+## Reporting problems, asking questions
+
+We appreciate any feedback, questions or bug reporting regarding this
+project. When help with code is needed, follow the process outlined in
+the Stack Overflow (https://stackoverflow.com/help/mcve)
+document. Ensure posted examples are:
+
+* minimal – use as little code as possible that still produces the
+  same problem
+
+* complete – provide all parts needed to reproduce the problem. Check
+  if you can strip external dependency and still show the problem. The
+  less time we spend on reproducing problems the more time we have to
+  fix it
+
+* verifiable – test the code you're about to provide to make sure it
+  reproduces the problem. Remove all other problems that are not
+  related to your request/question.