Skip to content

Commit 98ee6a1

Browse files
committed
Update README for 22.08 release
1 parent b3d7a33 commit 98ee6a1

File tree

2 files changed

+338
-2
lines changed

2 files changed

+338
-2
lines changed

README.md

+222-2
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,225 @@
2626
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2727
-->
2828

29-
**NOTE: You are currently on the r22.08 branch which tracks stabilization
30-
towards the next release. This branch is not usable during stabilization.**
29+
# Triton Inference Server
30+
31+
[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
32+
33+
**LATEST RELEASE: You are currently on the main branch which tracks
34+
under-development progress towards the next release. The current release is
35+
version [2.25.0](https://github.com/triton-inference-server/server/tree/r22.08)
36+
and corresponds to the 22.08 container release on
37+
[NVIDIA GPU Cloud (NGC)](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).**
38+
39+
----
40+
Triton Inference Server is an open source inference serving software that
41+
streamlines AI inferencing. Triton enables teams to deploy any AI model from
42+
multiple deep learning and machine learning frameworks, including TensorRT,
43+
TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
44+
supports inference across cloud, data center,edge and embedded devices on NVIDIA
45+
GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance
46+
for many query types, including real time, batched, ensembles and audio/video
47+
streaming.
48+
49+
Major features include:
50+
51+
- [Supports multiple deep learning
52+
frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
53+
- [Supports multiple machine learning
54+
frameworks](https://github.com/triton-inference-server/fil_backend)
55+
- [Concurrent model
56+
execution](docs/architecture.md#concurrent-model-execution)
57+
- [Dynamic batching](docs/model_configuration.md#dynamic-batcher)
58+
- [Sequence batching](docs/model_configuration.md#sequence-batcher) and
59+
[implicit state management](docs/architecture.md#implicit-state-management)
60+
for stateful models
61+
- Provides [Backend API](https://github.com/triton-inference-server/backend) that
62+
allows adding custom backends and pre/post processing operations
63+
- Model pipelines using
64+
[Ensembling](docs/architecture.md#ensemble-models) or [Business
65+
Logic Scripting
66+
(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
67+
- [HTTP/REST and GRPC inference
68+
protocols](docs/inference_protocols.md) based on the community
69+
developed [KServe
70+
protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
71+
- A [C API](docs/inference_protocols.md#in-process-triton-server-api) and
72+
[Java API](docs/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
73+
allow Triton to link directly into your application for edge and other in-process use cases
74+
- [Metrics](docs/metrics.md) indicating GPU utilization, server
75+
throughput, server latency, and more
76+
77+
Need enterprise support? NVIDIA global support is available for Triton
78+
Inference Server with the
79+
[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
80+
81+
## Serve a Model in 3 Easy Steps
82+
83+
```bash
84+
# Step 1: Create the example model repository
85+
git clone -b r22.08 https://github.com/triton-inference-server/server.git
86+
87+
cd server/docs/examples
88+
89+
./fetch_models.sh
90+
91+
# Step 2: Launch triton from the NGC Triton container
92+
docker run --gpus=1 --rm --net=host -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models
93+
94+
# Step 3: In a separate console, launch the image_client example from the NGC Triton SDK container
95+
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.08-py3-sdk
96+
97+
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
98+
99+
# Inference should return the following
100+
Image '/workspace/images/mug.jpg':
101+
15.346230 (504) = COFFEE MUG
102+
13.224326 (968) = CUP
103+
10.422965 (505) = COFFEEPOT
104+
```
105+
Please read the [QuickStart](docs/quickstart.md) guide for additional information
106+
regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/quickstart.md#run-on-cpu-only-system).
107+
108+
## Examples and Tutorials
109+
110+
Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/trial/)
111+
for free access to a set of hands-on labs with Triton Inference Server hosted on
112+
NVIDIA infrastructure.
113+
114+
Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM
115+
are located in the
116+
[NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples)
117+
page on GitHub. The
118+
[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server)
119+
contains additional documentation, presentations, and examples.
120+
121+
## Documentation
122+
123+
### Build and Deploy
124+
125+
The recommended way to build and use Triton Inference Server is with Docker
126+
images.
127+
128+
- [Install Triton Inference Server with Docker containers](docs/build.md#building-triton-with-docker) (*Recommended*)
129+
- [Install Triton Inference Server without Docker containers](docs/build.md#building-triton-without-docker)
130+
- [Build a custom Triton Inference Server Docker container](docs/compose.md)
131+
- [Build Triton Inference Server from source](docs/build.md#building-on-unsupported-platforms)
132+
- [Build Triton Inference Server for Windows 10](docs/build.md#building-for-windows-10)
133+
- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md),
134+
[AWS](deploy/aws/README.md), and [NVIDIA FleetCommand](deploy/fleetcommand/README.md)
135+
136+
### Using Triton
137+
138+
#### Preparing Models for Triton Inference Server
139+
140+
The first step in using Triton to serve your models is to place one or
141+
more models into a [model repository](docs/model_repository.md). Depending on
142+
the type of the model and on what Triton capabilities you want to enable for
143+
the model, you may need to create a [model
144+
configuration](docs/model_configuration.md) for the model.
145+
146+
- [Add custom operations to Triton if needed by your model](docs/custom_operations.md)
147+
- Enable model pipelining with [Model Ensemble](docs/architecture.md#ensemble-models)
148+
and [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
149+
- Optimize your models setting [scheduling and batching](docs/architecture.md#models-and-schedulers)
150+
parameters and [model instances](docs/model_configuration.md#instance-groups).
151+
- Use the [Model Analyzer tool](https://github.com/triton-inference-server/model_analyzer)
152+
to help optimize your model configuration with profiling
153+
- Learn how to [explicitly manage what models are available by loading and
154+
unloading models](docs/model_management.md)
155+
156+
#### Configure and Use Triton Inference Server
157+
158+
- Read the [Quick Start Guide](docs/quickstart.md) to run Triton Inference
159+
Server on both GPU and CPU
160+
- Triton supports multiple execution engines, called
161+
[backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including
162+
[TensorRT](https://github.com/triton-inference-server/tensorrt_backend),
163+
[TensorFlow](https://github.com/triton-inference-server/tensorflow_backend),
164+
[PyTorch](https://github.com/triton-inference-server/pytorch_backend),
165+
[ONNX](https://github.com/triton-inference-server/onnxruntime_backend),
166+
[OpenVINO](https://github.com/triton-inference-server/openvino_backend),
167+
[Python](https://github.com/triton-inference-server/python_backend), and more
168+
- Not all the above backends are supported on every platform supported by Triton.
169+
Look at the
170+
[Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/main/docs/backend_platform_support_matrix.md)
171+
to learn which backends are supported on your target platform.
172+
- Learn how to [optimize performance](docs/optimization.md) using the
173+
[Performance Analyzer](docs/perf_analyzer.md) and
174+
[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
175+
- Learn how to [manage loading and unloading models](docs/model_management.md) in
176+
Triton
177+
- Send requests directly to Triton with the [HTTP/REST JSON-based
178+
or gRPC protocols](docs/inference_protocols.md#httprest-and-grpc-protocols)
179+
180+
#### Client Support and Examples
181+
182+
A Triton *client* application sends inference and other requests to Triton. The
183+
[Python and C++ client libraries](https://github.com/triton-inference-server/client)
184+
provide APIs to simplify this communication.
185+
186+
- Review client examples for [C++](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/examples),
187+
[Python](https://github.com/triton-inference-server/client/blob/main/src/python/examples),
188+
and [Java](https://github.com/triton-inference-server/client/blob/main/src/java/src/main/java/triton/client/examples)
189+
- Configure [HTTP](https://github.com/triton-inference-server/client#http-options)
190+
and [gRPC](https://github.com/triton-inference-server/client#grpc-options)
191+
client options
192+
- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP
193+
request without any additional metadata](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md#raw-binary-request)
194+
195+
### Extend Triton
196+
197+
[Triton Inference Server's architecture](docs/architecture.md) is specifically
198+
designed for modularity and flexibility
199+
200+
- [Customize Triton Inference Server container](docs/compose.md) for your use case
201+
- [Create custom backends](https://github.com/triton-inference-server/backend)
202+
in either [C/C++](https://github.com/triton-inference-server/backend/blob/main/README.md#triton-backend-api)
203+
or [Python](https://github.com/triton-inference-server/python_backend)
204+
- Create [decouple backends and models](docs/decoupled_models.md) that can send
205+
multiple responses for a request or not send any responses for a request
206+
- Use a [Triton repository agent](docs/repository_agents.md) to add functionality
207+
that operates when a model is loaded and unloaded, such as authentication,
208+
decryption, or conversion
209+
- Deploy Triton on [Jetson and JetPack](docs/jetson.md)
210+
- [Use Triton on AWS
211+
Inferentia](https://github.com/triton-inference-server/python_backend/tree/main/inferentia)
212+
213+
### Additional Documentation
214+
215+
- [FAQ](docs/faq.md)
216+
- [User Guide](docs#user-guide)
217+
- [Developer Guide](docs#developer-guide)
218+
- [Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
219+
- [GPU, Driver, and CUDA Support
220+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
221+
222+
## Contributing
223+
224+
Contributions to Triton Inference Server are more than welcome. To
225+
contribute please review the [contribution
226+
guidelines](CONTRIBUTING.md). If you have a backend, client,
227+
example or similar contribution that is not modifying the core of
228+
Triton, then you should file a PR in the [contrib
229+
repo](https://github.com/triton-inference-server/contrib).
230+
231+
## Reporting problems, asking questions
232+
233+
We appreciate any feedback, questions or bug reporting regarding this project.
234+
When posting [issues in GitHub](https://github.com/triton-inference-server/server/issues),
235+
follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve).
236+
Ensure posted examples are:
237+
- minimal – use as little code as possible that still produces the
238+
same problem
239+
- complete – provide all parts needed to reproduce the problem. Check
240+
if you can strip external dependencies and still show the problem. The
241+
less time we spend on reproducing problems the more time we have to
242+
fix it
243+
- verifiable – test the code you're about to provide to make sure it
244+
reproduces the problem. Remove all other problems that are not
245+
related to your request/question.
246+
247+
## For more information
248+
249+
Please refer to the [NVIDIA Developer Triton page](https://developer.nvidia.com/nvidia-triton-inference-server)
250+
for more information.

RELEASE.md

+116
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
<!--
2+
# Copyright 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Release Notes for 2.25.0
30+
31+
## New Freatures and Improvements
32+
33+
* New
34+
[support for multiple cloud credentials](https://github.com/triton-inference-server/server/blob/main/docs/model_repository.md#cloud-storage-with-credential-file-beta)
35+
has been enabled. This feature is in beta and is subject to change.
36+
37+
* Models using custom backends which implement
38+
[auto-complete configuration](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md#auto-generated-model-configuration),
39+
can be loaded without explicit config.pbtxt file if they are named in form
40+
`<model_name>.<backend_name>`.
41+
42+
* Users can specify a maximum memory limit when loading models onto the GPU
43+
with the new
44+
[--model-load-gpu-limit](https://github.com/triton-inference-server/server/blob/b3d7a3375e7adb1341724c0ac34661b4cde23cd2/src/main.cc#L629-L635)
45+
tritonserver option and the
46+
[TRITONSERVER_ServerOptionsSetModelLoadDeviceLimit](https://github.com/triton-inference-server/core/blob/c9cd6630ecb04bb26e2110cd65a37f23aec8153b/include/triton/core/tritonserver.h#L1861-L1872) C API function
47+
48+
* Added new documentation,
49+
[Performance Tuning](https://github.com/triton-inference-server/server/blob/main/docs/performance_tuning.md), with a step by step guide to optimize models for
50+
production
51+
52+
* From this release onwards Triton will default to
53+
[TensorFlow version 2.X.](https://github.com/triton-inference-server/tensorflow_backend/tree/main#--backend-configtensorflowversionint)
54+
TensorFlow version 1.X can still be manually specified via backend config.
55+
56+
* PyTorch backend has improved performance by using a separate CUDA Stream for
57+
each model instance when the instance kind is GPU.
58+
59+
* Refer to the 22.08 column of the
60+
[Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
61+
for container image versions on which the 22.08 inference server container is
62+
based.
63+
64+
* Model Analyzer's profile subcommand now analyzes the results after Profile is
65+
completed. Usage of the Analyze subcommand is deprecated. See
66+
[Model Analyzer's documentation](https://github.com/triton-inference-server/model_analyzer/blob/main/docs/cli.md#subcommand-profile)
67+
for further details.
68+
69+
## Known Issues
70+
71+
* There is no Jetpack release for 22.08, the latest release is 22.07.
72+
73+
* Auto-complete may cause an increase in server start time. To avoid a start
74+
time increase, users can provide the full model configuration and launch the
75+
server with `--disable-auto-complete-config`.
76+
77+
* When auto-completing some model configs, backends may generate a model config
78+
even though there is not enough metadata (ex. Graphdef models for TensorFlow
79+
Backend). The user will see the model successfully load but fail to inference.
80+
In this case the user should provide the full model configuration for these
81+
models or use the `--disable-auto-complete-config` CLI option to show which
82+
models fail to load.
83+
84+
* Auto-complete does not support PyTorch models due to lack of metadata in the
85+
model. It can only verify that the number of inputs and the input names
86+
matches what is specified in the model configuration. There is no model
87+
metadata about the number of outputs and datatypes. Related PyTorch bug:
88+
https://github.com/pytorch/pytorch/issues/38273
89+
90+
* Auto-complete is not supported in the OpenVINO backend
91+
92+
* Perf Analyzer stability criteria has been changed which may result in
93+
reporting instability for scenarios that were previously considered stable.
94+
This change has been made to improve the accuracy of Perf Analyzer results.
95+
If you observe this message, it can be resolved by increasing the
96+
`--measurement-interval` in the time windows mode or
97+
`--measurement-request-count` in the count windows mode.
98+
99+
* Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will
100+
install an incorrect Jetson version of Triton Client library for Arm SBSA.
101+
102+
The correct client wheel file can be pulled directly from the Arm SBSA SDK
103+
image and manually installed.
104+
105+
* Traced models in PyTorch seem to create overflows when int8 tensor values are
106+
transformed to int32 on the GPU.
107+
108+
Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
109+
110+
* Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
111+
112+
* Triton metrics might not work if the host machine is running a separate DCGM
113+
agent on bare-metal or in a container.
114+
115+
* Model Analyzer reported values for GPU utilization and GPU power are known to
116+
be inaccurate and generally lower than reality.

0 commit comments

Comments
 (0)