Skip to content

Commit 46dbbe7

Browse files
Tabrizianmc-nvtanmayv25
authored
Update README and add RELEASE notes for 23.06 (#5991)
* Update README.md for 23.06 * Update documentation structure * Update RELEASE.md Co-authored-by: Tanmay Verma <[email protected]> --------- Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: Tanmay Verma <[email protected]>
1 parent 2c19b95 commit 46dbbe7

File tree

2 files changed

+387
-3
lines changed

2 files changed

+387
-3
lines changed

README.md

+281-3
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,284 @@
3030

3131
[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
3232

33-
**Note** <br>
34-
You are currently on the r23.06 branch which tracks stabilization towards the next release.<br>
35-
This branch is not usable during stabilization.
33+
Triton Inference Server is an open source inference serving software that
34+
streamlines AI inferencing. Triton enables teams to deploy any AI model from
35+
multiple deep learning and machine learning frameworks, including TensorRT,
36+
TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
37+
supports inference across cloud, data center,edge and embedded devices on NVIDIA
38+
GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance
39+
for many query types, including real time, batched, ensembles and audio/video
40+
streaming.
41+
42+
Major features include:
43+
44+
- [Supports multiple deep learning
45+
frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
46+
- [Supports multiple machine learning
47+
frameworks](https://github.com/triton-inference-server/fil_backend)
48+
- [Concurrent model
49+
execution](docs/user_guide/architecture.md#concurrent-model-execution)
50+
- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
51+
- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher)
52+
and
53+
[implicit state
54+
management](docs/user_guide/architecture.md#implicit-state-management)
55+
for stateful models
56+
- Provides [Backend API](https://github.com/triton-inference-server/backend)
57+
that
58+
allows adding custom backends and pre/post processing operations
59+
- Model pipelines using
60+
[Ensembling](docs/user_guide/architecture.md#ensemble-models) or [Business
61+
Logic Scripting
62+
(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
63+
- [HTTP/REST and GRPC inference
64+
protocols](docs/customization_guide/inference_protocols.md) based on the
65+
community
66+
developed [KServe
67+
protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
68+
- A [C
69+
API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api)
70+
and
71+
[Java
72+
API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
73+
allow Triton to link directly into your application for edge and other
74+
in-process use cases
75+
- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
76+
throughput, server latency, and more
77+
78+
**New to Triton Inference Server?** Make use of
79+
[these tutorials](https://github.com/triton-inference-server/tutorials)
80+
to begin your Triton journey!
81+
82+
Join the [Triton and TensorRT
83+
community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/)
84+
and
85+
stay current on the latest product updates, bug fixes, content, best practices,
86+
and more. Need enterprise support? NVIDIA global support is available for
87+
Triton
88+
Inference Server with the
89+
[NVIDIA AI Enterprise software
90+
suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
91+
92+
## Serve a Model in 3 Easy Steps
93+
94+
```bash
95+
# Step 1: Create the example model repository
96+
git clone -b r23.06 https://github.com/triton-inference-server/server.git
97+
cd server/docs/examples
98+
./fetch_models.sh
99+
100+
# Step 2: Launch triton from the NGC Triton container
101+
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models
102+
nvcr.io/nvidia/tritonserver:23.06-py3 tritonserver --model-repository=/models
103+
104+
# Step 3: Sending an Inference Request
105+
# In a separate console, launch the image_client example from the NGC Triton SDK
106+
container
107+
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.06-py3-sdk
108+
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION
109+
/workspace/images/mug.jpg
110+
111+
# Inference should return the following
112+
Image '/workspace/images/mug.jpg':
113+
15.346230 (504) = COFFEE MUG
114+
13.224326 (968) = CUP
115+
10.422965 (505) = COFFEEPOT
116+
```
117+
Please read the [QuickStart](docs/getting_started/quickstart.md) guide for
118+
additional information
119+
regarding this example. The quickstart guide also contains an example of how to
120+
launch Triton on [CPU-only
121+
systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to
122+
Triton and wondering where to get started? Watch the [Getting Started
123+
video](https://youtu.be/NQDtfSi5QF4).
124+
125+
## Examples and Tutorials
126+
127+
Check out [NVIDIA
128+
LaunchPad](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/trial/)
129+
for free access to a set of hands-on labs with Triton Inference Server hosted on
130+
NVIDIA infrastructure.
131+
132+
Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM
133+
are located in the
134+
[NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples)
135+
page on GitHub. The
136+
[NVIDIA Developer
137+
Zone](https://developer.nvidia.com/nvidia-triton-inference-server)
138+
contains additional documentation, presentations, and examples.
139+
140+
## Documentation
141+
142+
### Build and Deploy
143+
144+
The recommended way to build and use Triton Inference Server is with Docker
145+
images.
146+
147+
- [Install Triton Inference Server with Docker
148+
containers](docs/customization_guide/build.md#building-with-docker)
149+
(*Recommended*)
150+
- [Install Triton Inference Server without Docker
151+
containers](docs/customization_guide/build.md#building-without-docker)
152+
- [Build a custom Triton Inference Server Docker
153+
container](docs/customization_guide/compose.md)
154+
- [Build Triton Inference Server from
155+
source](docs/customization_guide/build.md#building-on-unsupported-platforms)
156+
- [Build Triton Inference Server for Windows
157+
10](docs/customization_guide/build.md#building-for-windows-10)
158+
- Examples for deploying Triton Inference Server with Kubernetes and Helm on
159+
[GCP](deploy/gcp/README.md),
160+
[AWS](deploy/aws/README.md), and [NVIDIA
161+
FleetCommand](deploy/fleetcommand/README.md)
162+
163+
### Using Triton
164+
165+
#### Preparing Models for Triton Inference Server
166+
167+
The first step in using Triton to serve your models is to place one or
168+
more models into a [model repository](docs/user_guide/model_repository.md).
169+
Depending on
170+
the type of the model and on what Triton capabilities you want to enable for
171+
the model, you may need to create a [model
172+
configuration](docs/user_guide/model_configuration.md) for the model.
173+
174+
- [Add custom operations to Triton if needed by your
175+
model](docs/user_guide/custom_operations.md)
176+
- Enable model pipelining with [Model
177+
Ensemble](docs/user_guide/architecture.md#ensemble-models)
178+
and [Business Logic Scripting
179+
(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
180+
- Optimize your models setting [scheduling and
181+
batching](docs/user_guide/architecture.md#models-and-schedulers)
182+
parameters and [model
183+
instances](docs/user_guide/model_configuration.md#instance-groups).
184+
- Use the [Model Analyzer
185+
tool](https://github.com/triton-inference-server/model_analyzer)
186+
to help optimize your model configuration with profiling
187+
- Learn how to [explicitly manage what models are available by loading and
188+
unloading models](docs/user_guide/model_management.md)
189+
190+
#### Configure and Use Triton Inference Server
191+
192+
- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton
193+
Inference
194+
Server on both GPU and CPU
195+
- Triton supports multiple execution engines, called
196+
[backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton),
197+
including
198+
[TensorRT](https://github.com/triton-inference-server/tensorrt_backend),
199+
[TensorFlow](https://github.com/triton-inference-server/tensorflow_backend),
200+
[PyTorch](https://github.com/triton-inference-server/pytorch_backend),
201+
[ONNX](https://github.com/triton-inference-server/onnxruntime_backend),
202+
[OpenVINO](https://github.com/triton-inference-server/openvino_backend),
203+
[Python](https://github.com/triton-inference-server/python_backend), and more
204+
- Not all the above backends are supported on every platform supported by
205+
Triton.
206+
Look at the
207+
[Backend-Platform Support
208+
Matrix](https://github.com/triton-inference-server/backend/blob/r23.06/docs/backend_platform_support_matrix.md)
209+
to learn which backends are supported on your target platform.
210+
- Learn how to [optimize performance](docs/user_guide/optimization.md) using the
211+
[Performance
212+
Analyzer](https://github.com/triton-inference-server/client/blob/r23.06/src/c++/perf_analyzer/README.md)
213+
and
214+
[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
215+
- Learn how to [manage loading and unloading
216+
models](docs/user_guide/model_management.md) in
217+
Triton
218+
- Send requests directly to Triton with the [HTTP/REST JSON-based
219+
or gRPC
220+
protocols](docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols)
221+
222+
#### Client Support and Examples
223+
224+
A Triton *client* application sends inference and other requests to Triton. The
225+
[Python and C++ client
226+
libraries](https://github.com/triton-inference-server/client)
227+
provide APIs to simplify this communication.
228+
229+
- Review client examples for
230+
[C++](https://github.com/triton-inference-server/client/blob/r23.06/src/c%2B%2B/examples),
231+
[Python](https://github.com/triton-inference-server/client/blob/r23.06/src/python/examples),
232+
and
233+
[Java](https://github.com/triton-inference-server/client/blob/r23.06/src/java/src/r23.06/java/triton/client/examples)
234+
- Configure
235+
[HTTP](https://github.com/triton-inference-server/client#http-options)
236+
and [gRPC](https://github.com/triton-inference-server/client#grpc-options)
237+
client options
238+
- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP
239+
request without any additional
240+
metadata](https://github.com/triton-inference-server/server/blob/r23.06/docs/protocol/extension_binary_data.md#raw-binary-request)
241+
242+
### Extend Triton
243+
244+
[Triton Inference Server's architecture](docs/user_guide/architecture.md) is
245+
specifically
246+
designed for modularity and flexibility
247+
248+
- [Customize Triton Inference Server
249+
container](docs/customization_guide/compose.md) for your use case
250+
- [Create custom backends](https://github.com/triton-inference-server/backend)
251+
in either
252+
[C/C++](https://github.com/triton-inference-server/backend/blob/r23.06/README.md#triton-backend-api)
253+
or [Python](https://github.com/triton-inference-server/python_backend)
254+
- Create [decouple backends and models](docs/user_guide/decoupled_models.md)
255+
that can send
256+
multiple responses for a request or not send any responses for a request
257+
- Use a [Triton repository agent](docs/customization_guide/repository_agents.md)
258+
to add functionality
259+
that operates when a model is loaded and unloaded, such as authentication,
260+
decryption, or conversion
261+
- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md)
262+
- [Use Triton on AWS
263+
Inferentia](https://github.com/triton-inference-server/python_backend/tree/r23.06/inferentia)
264+
265+
### Additional Documentation
266+
267+
- [FAQ](docs/user_guide/faq.md)
268+
- [User Guide](docs/README.md#user-guide)
269+
- [Customization Guide](docs/README.md#customization-guide)
270+
- [Release
271+
Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
272+
- [GPU, Driver, and CUDA Support
273+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
274+
275+
## Contributing
276+
277+
Contributions to Triton Inference Server are more than welcome. To
278+
contribute please review the [contribution
279+
guidelines](CONTRIBUTING.md). If you have a backend, client,
280+
example or similar contribution that is not modifying the core of
281+
Triton, then you should file a PR in the [contrib
282+
repo](https://github.com/triton-inference-server/contrib).
283+
284+
## Reporting problems, asking questions
285+
286+
We appreciate any feedback, questions or bug reporting regarding this project.
287+
When posting [issues in
288+
GitHub](https://github.com/triton-inference-server/server/issues),
289+
follow the process outlined in the [Stack Overflow
290+
document](https://stackoverflow.com/help/mcve).
291+
Ensure posted examples are:
292+
- minimal – use as little code as possible that still produces the
293+
same problem
294+
- complete – provide all parts needed to reproduce the problem. Check
295+
if you can strip external dependencies and still show the problem. The
296+
less time we spend on reproducing problems the more time we have to
297+
fix it
298+
- verifiable – test the code you're about to provide to make sure it
299+
reproduces the problem. Remove all other problems that are not
300+
related to your request/question.
301+
302+
For issues, please use the provided bug report and feature request templates.
303+
304+
For questions, we recommend posting in our community
305+
[GitHub
306+
Discussions.](https://github.com/triton-inference-server/server/discussions)
307+
308+
## For more information
309+
310+
Please refer to the [NVIDIA Developer Triton
311+
page](https://developer.nvidia.com/nvidia-triton-inference-server)
312+
for more information.
313+

RELEASE.md

+106
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
<!--
2+
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Release Notes for 2.35.0
30+
31+
## New Freatures and Improvements
32+
33+
* Support for
34+
[KIND\_MODEL instance type](https://github.com/triton-inference-server/pytorch_backend/tree/r23.06#model-instance-group-kind)
35+
has been extended to the PyTorch backend.
36+
37+
* The gRPC clients can now indicate whether they want to receive the flags
38+
associated with each response. This can help the clients to
39+
[programmatically determine](https://github.com/triton-inference-server/server/blob/r23.06/docs/user_guide/decoupled_models.md#knowing-when-a-decoupled-inference-request-is-complete)
40+
when all the responses for a given request have been received on the client
41+
side for decoupled models.
42+
43+
* Added beta support for using
44+
[Redis](https://github.com/triton-inference-server/redis_cache/tree/r23.06) as
45+
a cache for inference requests.
46+
47+
* The
48+
[statistics extension](https://github.com/triton-inference-server/server/blob/r23.06/docs/protocol/extension_statistics.md)
49+
now includes the memory usage of the loaded models. This statistics is
50+
currently implemented only for TensorRT and ONNXRuntime backends.
51+
52+
* Added support for batch inputs in ragged batching for PyTorch backend.
53+
54+
* Added
55+
[serial sequences](https://github.com/triton-inference-server/client/blob/main/src/c%2B%2B/perf_analyzer/docs/cli.md#--serial-sequences)
56+
mode for Perf Analyzer.
57+
58+
* Refer to the 23.06 column of the
59+
[Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
60+
for container image versions on which the 23.06 inference server container is
61+
based.
62+
63+
## Known Issues
64+
65+
* The Fastertransfer backend build only works with Triton 23.04 and older
66+
releases.
67+
68+
* OpenVINO 2022.1 is used in the OpenVINO backend and the OpenVINO execution
69+
provider for the Onnxruntime Backend. OpenVINO 2022.1 is not officially
70+
supported on Ubuntu 22.04 and should be treated as beta.
71+
72+
* Some systems which implement `malloc()` may not release memory back to the
73+
operating system right away causing a false memory leak. This can be mitigate
74+
by using a different malloc implementation. `tcmalloc` and `jemalloc` are
75+
installed in the Triton container and can be used by specifying the library in
76+
LD_PRELOAD.
77+
78+
We recommend experimenting with both `tcmalloc` and `jemalloc` to determine which
79+
one works better for your use case.
80+
81+
* Auto-complete may cause an increase in server start time. To avoid a start
82+
time increase, users can provide the full model configuration and launch the
83+
server with `--disable-auto-complete-config`.
84+
85+
* Auto-complete does not support PyTorch models due to lack of metadata in the
86+
model. It can only verify that the number of inputs and the input names
87+
matches what is specified in the model configuration. There is no model
88+
metadata about the number of outputs and datatypes. Related PyTorch bug:
89+
https://github.com/pytorch/pytorch/issues/38273
90+
91+
* Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will
92+
install an incorrect Jetson version of Triton Client library for Arm SBSA. The
93+
correct client wheel file can be pulled directly from the Arm SBSA SDK image
94+
and manually installed.
95+
96+
* Traced models in PyTorch seem to create overflows when int8 tensor values are
97+
transformed to int32 on the GPU. Refer to
98+
https://github.com/pytorch/pytorch/issues/66930 for more information.
99+
100+
* Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and
101+
A30).
102+
103+
* Triton metrics might not work if the host machine is running a separate DCGM
104+
agent on bare-metal or in a container.
105+
106+

0 commit comments

Comments
 (0)