Skip to content

Commit cd37327

Browse files
authored
Update README and add RELEASE notes for 23.05 (#5876)
* Update README and add RELEASE notes for 23.05 * Update README and add RELEASE notes for 23.05 * Update README and add RELEASE notes for 23.05 * Update README and add RELEASE notes for 23.05 * Update README and add RELEASE notes for 23.05
1 parent 02700fa commit cd37327

File tree

2 files changed

+332
-3
lines changed

2 files changed

+332
-3
lines changed

README.md

+221-3
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,225 @@
2727
-->
2828

2929
# Triton Inference Server
30+
----
31+
Triton Inference Server is an open source inference serving software that
32+
streamlines AI inferencing. Triton enables teams to deploy any AI model from
33+
multiple deep learning and machine learning frameworks, including TensorRT,
34+
TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
35+
supports inference across cloud, data center,edge and embedded devices on NVIDIA
36+
GPUs, x86 and ARM CPU, or AWS Inferentia. Triton delivers optimized performance
37+
for many query types, including real time, batched, ensembles and audio/video
38+
streaming.
3039

31-
**Note** <br>
32-
You are currently on the r23.05 branch which tracks stabilization towards the next release.<br>
33-
This branch is not usable during stabilization.
40+
Major features include:
41+
42+
- [Supports multiple deep learning
43+
frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton)
44+
- [Supports multiple machine learning
45+
frameworks](https://github.com/triton-inference-server/fil_backend)
46+
- [Concurrent model
47+
execution](docs/user_guide/architecture.md#concurrent-model-execution)
48+
- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher)
49+
- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and
50+
[implicit state management](docs/user_guide/architecture.md#implicit-state-management)
51+
for stateful models
52+
- Provides [Backend API](https://github.com/triton-inference-server/backend) that
53+
allows adding custom backends and pre/post processing operations
54+
- Model pipelines using
55+
[Ensembling](docs/user_guide/architecture.md#ensemble-models) or [Business
56+
Logic Scripting
57+
(BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
58+
- [HTTP/REST and GRPC inference
59+
protocols](docs/customization_guide/inference_protocols.md) based on the community
60+
developed [KServe
61+
protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2)
62+
- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and
63+
[Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api)
64+
allow Triton to link directly into your application for edge and other in-process use cases
65+
- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
66+
throughput, server latency, and more
67+
68+
**New to Triton Inference Server?** Make use of
69+
[these tutorials](https://github.com/triton-inference-server/tutorials)
70+
to begin your Triton journey!
71+
72+
Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and
73+
stay current on the latest product updates, bug fixes, content, best practices,
74+
and more. Need enterprise support? NVIDIA global support is available for Triton
75+
Inference Server with the
76+
[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/).
77+
78+
## Serve a Model in 3 Easy Steps
79+
80+
```bash
81+
# Step 1: Create the example model repository
82+
git clone -b r23.05 https://github.com/triton-inference-server/server.git
83+
cd server/docs/examples
84+
./fetch_models.sh
85+
86+
# Step 2: Launch triton from the NGC Triton container
87+
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.05-py3 tritonserver --model-repository=/models
88+
89+
# Step 3: Sending an Inference Request
90+
# In a separate console, launch the image_client example from the NGC Triton SDK container
91+
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:23.05-py3-sdk
92+
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
93+
94+
# Inference should return the following
95+
Image '/workspace/images/mug.jpg':
96+
15.346230 (504) = COFFEE MUG
97+
13.224326 (968) = CUP
98+
10.422965 (505) = COFFEEPOT
99+
```
100+
Please read the [QuickStart](docs/getting_started/quickstart.md) guide for additional information
101+
regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to Triton and wondering where to get started? Watch the [Getting Started video](https://youtu.be/NQDtfSi5QF4).
102+
103+
## Examples and Tutorials
104+
105+
Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/trial/)
106+
for free access to a set of hands-on labs with Triton Inference Server hosted on
107+
NVIDIA infrastructure.
108+
109+
Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM
110+
are located in the
111+
[NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples)
112+
page on GitHub. The
113+
[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server)
114+
contains additional documentation, presentations, and examples.
115+
116+
## Documentation
117+
118+
### Build and Deploy
119+
120+
The recommended way to build and use Triton Inference Server is with Docker
121+
images.
122+
123+
- [Install Triton Inference Server with Docker containers](docs/customization_guide/build.md#building-with-docker) (*Recommended*)
124+
- [Install Triton Inference Server without Docker containers](docs/customization_guide/build.md#building-without-docker)
125+
- [Build a custom Triton Inference Server Docker container](docs/customization_guide/compose.md)
126+
- [Build Triton Inference Server from source](docs/customization_guide/build.md#building-on-unsupported-platforms)
127+
- [Build Triton Inference Server for Windows 10](docs/customization_guide/build.md#building-for-windows-10)
128+
- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md),
129+
[AWS](deploy/aws/README.md), and [NVIDIA FleetCommand](deploy/fleetcommand/README.md)
130+
131+
### Using Triton
132+
133+
#### Preparing Models for Triton Inference Server
134+
135+
The first step in using Triton to serve your models is to place one or
136+
more models into a [model repository](docs/user_guide/model_repository.md). Depending on
137+
the type of the model and on what Triton capabilities you want to enable for
138+
the model, you may need to create a [model
139+
configuration](docs/user_guide/model_configuration.md) for the model.
140+
141+
- [Add custom operations to Triton if needed by your model](docs/user_guide/custom_operations.md)
142+
- Enable model pipelining with [Model Ensemble](docs/user_guide/architecture.md#ensemble-models)
143+
and [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting)
144+
- Optimize your models setting [scheduling and batching](docs/user_guide/architecture.md#models-and-schedulers)
145+
parameters and [model instances](docs/user_guide/model_configuration.md#instance-groups).
146+
- Use the [Model Analyzer tool](https://github.com/triton-inference-server/model_analyzer)
147+
to help optimize your model configuration with profiling
148+
- Learn how to [explicitly manage what models are available by loading and
149+
unloading models](docs/user_guide/model_management.md)
150+
151+
#### Configure and Use Triton Inference Server
152+
153+
- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton Inference
154+
Server on both GPU and CPU
155+
- Triton supports multiple execution engines, called
156+
[backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including
157+
[TensorRT](https://github.com/triton-inference-server/tensorrt_backend),
158+
[TensorFlow](https://github.com/triton-inference-server/tensorflow_backend),
159+
[PyTorch](https://github.com/triton-inference-server/pytorch_backend),
160+
[ONNX](https://github.com/triton-inference-server/onnxruntime_backend),
161+
[OpenVINO](https://github.com/triton-inference-server/openvino_backend),
162+
[Python](https://github.com/triton-inference-server/python_backend), and more
163+
- Not all the above backends are supported on every platform supported by Triton.
164+
Look at the
165+
[Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/r23.05/docs/backend_platform_support_matrix.md)
166+
to learn which backends are supported on your target platform.
167+
- Learn how to [optimize performance](docs/user_guide/optimization.md) using the
168+
[Performance Analyzer](https://github.com/triton-inference-server/client/blob/r23.05/src/c++/perf_analyzer/README.md)
169+
and
170+
[Model Analyzer](https://github.com/triton-inference-server/model_analyzer)
171+
- Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in
172+
Triton
173+
- Send requests directly to Triton with the [HTTP/REST JSON-based
174+
or gRPC protocols](docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols)
175+
176+
#### Client Support and Examples
177+
178+
A Triton *client* application sends inference and other requests to Triton. The
179+
[Python and C++ client libraries](https://github.com/triton-inference-server/client)
180+
provide APIs to simplify this communication.
181+
182+
- Review client examples for [C++](https://github.com/triton-inference-server/client/tree/r23.05/src/c%2B%2B/examples),
183+
[Python](https://github.com/triton-inference-server/client/tree/r23.05/src/python/examples),
184+
and [Java](https://github.com/triton-inference-server/client/tree/r23.05/src/java/src/main/java/triton/client)
185+
- Configure [HTTP](https://github.com/triton-inference-server/client#http-options)
186+
and [gRPC](https://github.com/triton-inference-server/client#grpc-options)
187+
client options
188+
- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP
189+
request without any additional metadata](https://github.com/triton-inference-server/server/blob/r23.05/docs/protocol/extension_binary_data.md#raw-binary-request)
190+
191+
### Extend Triton
192+
193+
[Triton Inference Server's architecture](docs/user_guide/architecture.md) is specifically
194+
designed for modularity and flexibility
195+
196+
- [Customize Triton Inference Server container](docs/customization_guide/compose.md) for your use case
197+
- [Create custom backends](https://github.com/triton-inference-server/backend)
198+
in either [C/C++](https://github.com/triton-inference-server/backend/blob/r23.05/README.md#triton-backend-api)
199+
or [Python](https://github.com/triton-inference-server/python_backend)
200+
- Create [decouple backends and models](docs/user_guide/decoupled_models.md) that can send
201+
multiple responses for a request or not send any responses for a request
202+
- Use a [Triton repository agent](docs/customization_guide/repository_agents.md) to add functionality
203+
that operates when a model is loaded and unloaded, such as authentication,
204+
decryption, or conversion
205+
- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md)
206+
- [Use Triton on AWS
207+
Inferentia](https://github.com/triton-inference-server/python_backend/tree/r23.05/inferentia)
208+
209+
### Additional Documentation
210+
211+
- [FAQ](docs/user_guide/faq.md)
212+
- [User Guide](docs/README.md#user-guide)
213+
- [Customization Guide](docs/README.md#customization-guide)
214+
- [Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html)
215+
- [GPU, Driver, and CUDA Support
216+
Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html)
217+
218+
## Contributing
219+
220+
Contributions to Triton Inference Server are more than welcome. To
221+
contribute please review the [contribution
222+
guidelines](CONTRIBUTING.md). If you have a backend, client,
223+
example or similar contribution that is not modifying the core of
224+
Triton, then you should file a PR in the [contrib
225+
repo](https://github.com/triton-inference-server/contrib).
226+
227+
## Reporting problems, asking questions
228+
229+
We appreciate any feedback, questions or bug reporting regarding this project.
230+
When posting [issues in GitHub](https://github.com/triton-inference-server/server/issues),
231+
follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve).
232+
Ensure posted examples are:
233+
- minimal – use as little code as possible that still produces the
234+
same problem
235+
- complete – provide all parts needed to reproduce the problem. Check
236+
if you can strip external dependencies and still show the problem. The
237+
less time we spend on reproducing problems the more time we have to
238+
fix it
239+
- verifiable – test the code you're about to provide to make sure it
240+
reproduces the problem. Remove all other problems that are not
241+
related to your request/question.
242+
243+
For issues, please use the provided bug report and feature request templates.
244+
245+
For questions, we recommend posting in our community
246+
[GitHub Discussions.](https://github.com/triton-inference-server/server/discussions)
247+
248+
## For more information
249+
250+
Please refer to the [NVIDIA Developer Triton page](https://developer.nvidia.com/nvidia-triton-inference-server)
251+
for more information.

RELEASE.md

+111
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
<!--
2+
# Copyright 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Release Notes for 2.34.0
30+
31+
## New Freatures and Improvements
32+
33+
* Python backend supports
34+
[Custom Metrics](https://github.com/triton-inference-server/python_backend/tree/r23.05#custom-metrics)
35+
allowing users to define and report counters and gauges similar to the
36+
[C API](https://github.com/triton-inference-server/server/blob/r23.05/docs/user_guide/metrics.md#custom-metrics).
37+
38+
* Python Triton Client defines the
39+
[Triton Client Plugin API](https://github.com/triton-inference-server/client/tree/r23.05#python-client-plugin-api-beta)
40+
allowing users to register custom plugins to add or modify request headers.
41+
This feature is in beta and is subject to change in future releases.
42+
43+
* Improved performance of model instance creation/removal. When the model
44+
instance group is the only model configuration change, Triton will update the
45+
model with the number of instances needed rather than reloading the model.
46+
This feature is limited to non-sequence models only. Read more about this
47+
feature
48+
[here](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_management.md#modifying-the-model-repository)
49+
in bullet point four.
50+
51+
* Added new command line option
52+
[`--metrics-address=<address>`](https://github.com/triton-inference-server/server/blob/r23.05/docs/user_guide/metrics.md#metrics)
53+
allowing the metrics server to bind to a different address than the default
54+
`0.0.0.0`.
55+
56+
* Reduced the default number of model load threads from 2*(number of CPU cores)
57+
to 4. This eliminates Triton hitting resource limits on systems with large CPU
58+
core counts. Use the `--model-load-thread-count` command line option to change
59+
this default.
60+
61+
* Added support for
62+
[DLPack Python specification](https://dmlc.github.io/dlpack/latest/python_spec.html)
63+
in
64+
[Python backend](https://github.com/triton-inference-server/python_backend#pb_utilstensorfrom_dlpack---tensor).
65+
66+
* Refer to the 23.05 column of the
67+
[Frameworks Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)
68+
for container image versions on which the 23.05 inference server container is
69+
based.
70+
71+
## Known Issues
72+
73+
* Tensorflow backend no longer supports TensorFlow version 1.
74+
75+
* OpenVINO 2022.1 is used in the OpenVINO backend and the OpenVINO execution
76+
provider for the Onnxruntime Backend. OpenVINO 2022.1 is not officially
77+
supported on Ubuntu 22.04 and should be treated as beta.
78+
79+
* Some systems which implement `malloc()` may not release memory back to the
80+
operating system right away causing a false memory leak. This can be mitigated
81+
by using a different malloc implementation. Tcmalloc and jemalloc are
82+
installed in the Triton container and can be
83+
[used by specifying the library in LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r22.12/docs/user_guide/model_management.md).
84+
We recommend experimenting with both `tcmalloc` and `jemalloc` to determine which
85+
one works better for your use case.
86+
87+
* Auto-complete may cause an increase in server start time. To avoid a start
88+
time increase, users can provide the full model configuration and launch the
89+
server with `--disable-auto-complete-config`.
90+
91+
* Auto-complete does not support PyTorch models due to lack of metadata in the
92+
model. It can only verify that the number of inputs and the input names
93+
matches what is specified in the model configuration. There is no model
94+
metadata about the number of outputs and datatypes. Related PyTorch bug:
95+
https://github.com/pytorch/pytorch/issues/38273.
96+
97+
* Triton Client PIP wheels for ARM SBSA are not available from PyPI and pip will
98+
install an incorrect Jetson version of Triton Client library for Arm SBSA.
99+
100+
The correct client wheel file can be pulled directly from the Arm SBSA SDK
101+
image and manually installed.
102+
103+
* Traced models in PyTorch seem to create overflows when int8 tensor values are
104+
transformed to int32 on the GPU.
105+
106+
Refer to https://github.com/pytorch/pytorch/issues/66930 for more information.
107+
108+
* Triton cannot retrieve GPU metrics with MIG-enabled GPU devices (A100 and A30).
109+
110+
* Triton metrics might not work if the host machine is running a separate DCGM
111+
agent on bare-metal or in a container.

0 commit comments

Comments
 (0)