Update README and versions for 21.06 release (#3035)

dzier · web-flow · commit 30b08baff16d · 2021-06-25T13:15:13.000-07:00
diff --git a/README.md b/README.md
@@ -30,5 +30,297 @@
 
 # Triton Inference Server
 
-**NOTE: You are currently on the r21.06 branch which tracks stabilization
-towards the next release. This branch is not usable during stabilization.**
+Triton Inference Server provides a cloud and edge inferencing solution
+optimized for both CPUs and GPUs. Triton supports an HTTP/REST and
+GRPC protocol that allows remote clients to request inferencing for
+any model being managed by the server. For edge deployments, Triton is
+available as a shared library with a C API that allows the full
+functionality of Triton to be included directly in an
+application.
+
+The current release of the Triton Inference Server is 2.11.0 and
+corresponds to the 21.06 release of the tritonserver container on
+[NVIDIA GPU Cloud (NGC)](https://ngc.nvidia.com). The branch for this
+release is
+[r21.06](https://github.com/triton-inference-server/server/tree/r21.06).
+
+## What's New in 2.11.0
+
+* The [Forest Inference Library (FIL)](https://github.com/triton-inference-server/fil_backend) 
+  backend is added to Triton. The FIL backend allows forest models trained by 
+  several popular machine learning frameworks (including XGBoost, LightGBM, 
+  Scikit-Learn, and cuML) to be deployed in a Triton.
+
+* Windows version of Triton now includes the 
+  [OpenVino backend](https://github.com/triton-inference-server/openvino_backend).
+
+* The Performance Analyzer (perf_analyzer) now supports testing against the 
+  Triton C API.
+
+* The Python backend now allows the use of conda to create a unique execution 
+  environment for your Python model. See https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments.
+
+* Python models that crash or exit unexpectedly are now automatically restarted 
+  by Triton.
+
+* Model repositories in S3 storage can now be accessed using HTTPS protocol. See
+  https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#s3 
+  for more information.
+
+* Triton now collects GPU metrics for MIG partitions.
+
+* Passive model instances can now be specified in the model configuration. A 
+  passive model instance will be loaded and initialized by Triton, but no 
+  inference requests will be sent to the instance. Passive instances are 
+  typically used by a custom backend that uses its own mechanisms to distribute 
+  work to the passive instances. See the ModelInstanceGroup section of 
+  [model_config.proto](https://github.com/triton-inference-server/common/blob/r21.06/protobuf/model_config.proto) for the setting.
+
+* NVDLA support is added to the TensorRT backend.
+
+* ONNX Runtime version updated to 1.8.0.
+
+* Windows build documentation simplified and improved.
+
+* Improved detailed and summary reports in Model Analyzer.
+
+* Added an offline mode to Model Analyzer.
+
+## Features
+
+* [Multiple deep-learning
+  frameworks](https://github.com/triton-inference-server/backend). Triton
+  can manage any number and mix of models (limited by system disk and
+  memory resources). Triton supports TensorRT, TensorFlow GraphDef,
+  TensorFlow SavedModel, ONNX, PyTorch TorchScript and OpenVINO model
+  formats. Both TensorFlow 1.x and TensorFlow 2.x are
+  supported. Triton also supports TensorFlow-TensorRT and
+  ONNX-TensorRT integrated models.
+
+* [Concurrent model
+  execution](docs/architecture.md#concurrent-model-execution). Multiple
+  models (or multiple instances of the same model) can run
+  simultaneously on the same GPU or on multiple GPUs.
+
+* [Dynamic batching](docs/architecture.md#models-and-schedulers). For
+  models that support batching, Triton implements multiple scheduling
+  and batching algorithms that combine individual inference requests
+  together to improve inference throughput. These scheduling and
+  batching decisions are transparent to the client requesting
+  inference.
+
+* [Extensible
+  backends](https://github.com/triton-inference-server/backend). In
+  addition to deep-learning frameworks, Triton provides a *backend
+  API* that allows Triton to be extended with any model execution
+  logic implemented in
+  [Python](https://github.com/triton-inference-server/python_backend)
+  or
+  [C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api),
+  while still benefiting from the CPU and GPU support, concurrent
+  execution, dynamic batching and other features provided by Triton.
+
+* [Model pipelines](docs/architecture.md#ensemble-models). Triton
+  *ensembles* represents a pipeline of one or more models and the
+  connection of input and output tensors between those models. A
+  single inference request to an ensemble will trigger the execution
+  of the entire pipeline.
+
+* [HTTP/REST and GRPC inference
+  protocols](docs/inference_protocols.md) based on the community
+  developed [KFServing
+  protocol](https://github.com/kubeflow/kfserving/tree/master/docs/predict-api/v2).
+
+* A [C API](docs/inference_protocols.md#c-api) allows Triton to be
+  linked directly into your application for edge and other in-process
+  use cases.
+
+* [Metrics](docs/metrics.md) indicating GPU utilization, server
+  throughput, and server latency. The metrics are provided in
+  Prometheus data format.
+
+## Documentation
+
+[Triton Architecture](docs/architecture.md) gives a high-level
+overview of the structure and capabilities of the inference
+server. There is also an [FAQ](docs/faq.md). Additional documentation
+is divided into [*user*](#user-documentation) and
+[*developer*](#developer-documentation) sections. The *user*
+documentation describes how to use Triton as an inference solution,
+including information on how to configure Triton, how to organize and
+configure your models, how to use the C++ and Python clients, etc. The
+*developer* documentation describes how to build and test Triton and
+also how Triton can be extended with new functionality.
+
+The Triton [Release
+Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
+and [Support
+Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
+indicate the required versions of the NVIDIA Driver and CUDA, and also
+describe supported GPUs.
+
+### User Documentation
+
+- [QuickStart](docs/quickstart.md)
+  - [Install](docs/quickstart.md#install-triton-docker-image)
+  - [Run](docs/quickstart.md#run-triton)
+- [Model Repository](docs/model_repository.md)
+- [Model Configuration](docs/model_configuration.md)
+- [Model Management](docs/model_management.md)
+- [Custom Operations](docs/custom_operations.md)
+- [Client Libraries and Examples](https://github.com/triton-inference-server/client)
+- [Optimization](docs/optimization.md)
+  - [Model Analyzer](docs/model_analyzer.md)
+  - [Performance Analyzer](docs/perf_analyzer.md)
+- [Metrics](docs/metrics.md)
+
+The [quickstart](docs/quickstart.md) walks you through all the steps
+required to install and run Triton with an example image
+classification model and then use an example client application to
+perform inferencing using that model. The quickstart also demonstrates
+how [Triton supports both GPU systems and CPU-only
+systems](docs/quickstart.md#run-triton).
+
+The first step in using Triton to serve your models is to place one or
+more models into a [model
+repository](docs/model_repository.md). Optionally, depending on the type
+of the model and on what Triton capabilities you want to enable for
+the model, you may need to create a [model
+configuration](docs/model_configuration.md) for the model.  If your
+model has [custom operations](docs/custom_operations.md) you will need
+to make sure they are loaded correctly by Triton.
+
+After you have your model(s) available in Triton, you will want to
+send inference and other requests to Triton from your *client*
+application. The [Python and C++ client
+libraries](https://github.com/triton-inference-server/client) provide
+APIs to simplify this communication. There are also a large number of
+[client examples](https://github.com/triton-inference-server/client)
+that demonstrate how to use the libraries.  You can also send
+HTTP/REST requests directly to Triton using the [HTTP/REST JSON-based
+protocol](docs/inference_protocols.md#httprest-and-grpc-protocols) or
+[generate a GRPC client for many other
+languages](https://github.com/triton-inference-server/client).
+
+Understanding and [optimizing performance](docs/optimization.md) is an
+important part of deploying your models. The Triton project provides
+the [Performance Analyzer](docs/perf_analyzer.md) and the [Model
+Analyzer](docs/model_analyzer.md) to help your optimization
+efforts. Specifically, you will want to optimize [scheduling and
+batching](docs/architecture.md#models-and-schedulers) and [model
+instances](docs/model_configuration.md#instance-groups) appropriately
+for each model. You may also want to consider [ensembling multiple
+models and pre/post-processing](docs/architecture.md#ensemble-models)
+into a pipeline. In some cases you may find [individual inference
+request trace data](docs/trace.md) useful when optimizing. A
+[Prometheus metrics endpoint](docs/metrics.md) allows you to visualize
+and monitor aggregate inference metrics.
+
+NVIDIA publishes a number of [deep learning
+examples](https://github.com/NVIDIA/DeepLearningExamples) that use
+Triton.
+
+As part of your deployment strategy you may want to [explicitly manage
+what models are available by loading and unloading
+models](docs/model_management.md) from a running Triton server. If you
+are using Kubernetes for deployment there are simple examples of how
+to deploy Triton using Kubernetes and Helm, one for
+[GCP](deploy/gcp/README.md) and one for [AWS](deploy/aws/README.md).
+
+The [version 1 to version 2 migration
+information](docs/v1_to_v2.md) is helpful if you are moving to
+version 2 of Triton from previously using version 1.
+
+### Developer Documentation
+
+- [Build](docs/build.md)
+- [Protocols and APIs](docs/inference_protocols.md).
+- [Backends](https://github.com/triton-inference-server/backend)
+- [Repository Agents](docs/repository_agents.md)
+- [Test](docs/test.md)
+
+Triton can be [built using
+Docker](docs/build.md#building-triton-with-docker) or [built without
+Docker](docs/build.md#building-triton-without-docker). After building
+you should [test Triton](docs/test.md).
+
+It is also possible to [create a Docker image containing a customized
+Triton](docs/compose.md) that contains only a subset of the backends.
+
+The Triton project also provides [client libraries for Python and
+C++](https://github.com/triton-inference-server/client) that make it
+easy to communicate with the server. There are also a large number of
+[example clients](https://github.com/triton-inference-server/client)
+that demonstrate how to use the libraries. You can also develop your
+own clients that directly communicate with Triton using [HTTP/REST or
+GRPC protocols](docs/inference_protocols.md). There is also a [C
+API](docs/inference_protocols.md) that allows Triton to be linked
+directly into your application.
+
+A [Triton backend](https://github.com/triton-inference-server/backend)
+is the implementation that executes a model. A backend can interface
+with a deep learning framework, like PyTorch, TensorFlow, TensorRT or
+ONNX Runtime; or it can interface with a data processing framework
+like [DALI](https://github.com/triton-inference-server/dali_backend);
+or you can extend Triton by [writing your own
+backend](https://github.com/triton-inference-server/backend) in either
+[C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
+or
+[Python](https://github.com/triton-inference-server/python_backend).
+
+A [Triton repository agent](docs/repository_agents.md) extends Triton
+with new functionality that operates when a model is loaded or
+unloaded. You can introduce your own code to perform authentication,
+decryption, conversion, or similar operations when a model is loaded.
+
+## Papers and Presentation
+
+* [Maximizing Deep Learning Inference Performance with NVIDIA Model
+  Analyzer](https://developer.nvidia.com/blog/maximizing-deep-learning-inference-performance-with-nvidia-model-analyzer/).
+
+* [High-Performance Inferencing at Scale Using the TensorRT Inference
+  Server](https://developer.nvidia.com/gtc/2020/video/s22418).
+
+* [Accelerate and Autoscale Deep Learning Inference on GPUs with
+  KFServing](https://developer.nvidia.com/gtc/2020/video/s22459).
+
+* [Deep into Triton Inference Server: BERT Practical Deployment on
+  NVIDIA GPU](https://developer.nvidia.com/gtc/2020/video/s21736).
+
+* [Maximizing Utilization for Data Center Inference with TensorRT
+  Inference Server](https://on-demand-gtc.gputechconf.com/gtcnew/sessionview.php?sessionName=s9438-maximizing+utilization+for+data+center+inference+with+tensorrt+inference+server).
+
+* [NVIDIA TensorRT Inference Server Boosts Deep Learning
+  Inference](https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/).
+
+* [GPU-Accelerated Inference for Kubernetes with the NVIDIA TensorRT
+  Inference Server and
+  Kubeflow](https://www.kubeflow.org/blog/nvidia_tensorrt/).
+
+## Contributing
+
+Contributions to Triton Inference Server are more than welcome. To
+contribute make a pull request and follow the guidelines outlined in
+[CONTRIBUTING.md](CONTRIBUTING.md). If you have a backend, client,
+example or similar contribution that is not modifying the core of
+Triton, then you should file a PR in the [contrib
+repo](https://github.com/triton-inference-server/contrib).
+
+## Reporting problems, asking questions
+
+We appreciate any feedback, questions or bug reporting regarding this
+project. When help with code is needed, follow the process outlined in
+the Stack Overflow (https://stackoverflow.com/help/mcve)
+document. Ensure posted examples are:
+
+* minimal – use as little code as possible that still produces the
+  same problem
+
+* complete – provide all parts needed to reproduce the problem. Check
+  if you can strip external dependency and still show the problem. The
+  less time we spend on reproducing problems the more time we have to
+  fix it
+
+* verifiable – test the code you're about to provide to make sure it
+  reproduces the problem. Remove all other problems that are not
+  related to your request/question.