|
27 | 27 | -->
|
28 | 28 |
|
29 | 29 | # Triton Inference Server
|
| 30 | +[](https://opensource.org/licenses/BSD-3-Clause) |
30 | 31 |
|
31 |
| ---- |
| 32 | +Triton Inference Server is an open source inference serving software that |
| 33 | +streamlines AI inferencing. Triton enables teams to deploy any AI model from |
| 34 | +multiple deep learning and machine learning frameworks, including TensorRT, |
| 35 | +TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton |
| 36 | +Inference Server supports inference across cloud, data center, edge and embedded |
| 37 | +devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference |
| 38 | +Server delivers optimized performance for many query types, including real time, |
| 39 | +batched, ensembles and audio/video streaming. Triton inference Server is part of |
| 40 | +[NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), |
| 41 | +a software platform that accelerates the data science pipeline and streamlines |
| 42 | +the development and deployment of production AI. |
32 | 43 |
|
33 |
| -[](https://opensource.org/licenses/BSD-3-Clause) |
| 44 | +Major features include: |
| 45 | + |
| 46 | +- [Supports multiple deep learning |
| 47 | + frameworks](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton) |
| 48 | +- [Supports multiple machine learning |
| 49 | + frameworks](https://github.com/triton-inference-server/fil_backend) |
| 50 | +- [Concurrent model |
| 51 | + execution](docs/user_guide/architecture.md#concurrent-model-execution) |
| 52 | +- [Dynamic batching](docs/user_guide/model_configuration.md#dynamic-batcher) |
| 53 | +- [Sequence batching](docs/user_guide/model_configuration.md#sequence-batcher) and |
| 54 | + [implicit state management](docs/user_guide/architecture.md#implicit-state-management) |
| 55 | + for stateful models |
| 56 | +- Provides [Backend API](https://github.com/triton-inference-server/backend) that |
| 57 | + allows adding custom backends and pre/post processing operations |
| 58 | +- Supports writing custom backends in python, a.k.a. |
| 59 | + [Python-based backends.](https://github.com/triton-inference-server/backend/blob/r24.04/docs/python_based_backends.md#python-based-backends) |
| 60 | +- Model pipelines using |
| 61 | + [Ensembling](docs/user_guide/architecture.md#ensemble-models) or [Business |
| 62 | + Logic Scripting |
| 63 | + (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
| 64 | +- [HTTP/REST and GRPC inference |
| 65 | + protocols](docs/customization_guide/inference_protocols.md) based on the community |
| 66 | + developed [KServe |
| 67 | + protocol](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) |
| 68 | +- A [C API](docs/customization_guide/inference_protocols.md#in-process-triton-server-api) and |
| 69 | + [Java API](docs/customization_guide/inference_protocols.md#java-bindings-for-in-process-triton-server-api) |
| 70 | + allow Triton to link directly into your application for edge and other in-process use cases |
| 71 | +- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server |
| 72 | + throughput, server latency, and more |
| 73 | + |
| 74 | +**New to Triton Inference Server?** Make use of |
| 75 | +[these tutorials](https://github.com/triton-inference-server/tutorials) |
| 76 | +to begin your Triton journey! |
| 77 | + |
| 78 | +Join the [Triton and TensorRT community](https://www.nvidia.com/en-us/deep-learning-ai/triton-tensorrt-newsletter/) and |
| 79 | +stay current on the latest product updates, bug fixes, content, best practices, |
| 80 | +and more. Need enterprise support? NVIDIA global support is available for Triton |
| 81 | +Inference Server with the |
| 82 | +[NVIDIA AI Enterprise software suite](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/). |
| 83 | + |
| 84 | +## Serve a Model in 3 Easy Steps |
| 85 | + |
| 86 | +```bash |
| 87 | +# Step 1: Create the example model repository |
| 88 | +git clone -b r24.04 https://github.com/triton-inference-server/server.git |
| 89 | +cd server/docs/examples |
| 90 | +./fetch_models.sh |
| 91 | + |
| 92 | +# Step 2: Launch triton from the NGC Triton container |
| 93 | +docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.04-py3 tritonserver --model-repository=/models |
| 94 | + |
| 95 | +# Step 3: Sending an Inference Request |
| 96 | +# In a separate console, launch the image_client example from the NGC Triton SDK container |
| 97 | +docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:24.04-py3-sdk |
| 98 | +/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg |
| 99 | + |
| 100 | +# Inference should return the following |
| 101 | +Image '/workspace/images/mug.jpg': |
| 102 | + 15.346230 (504) = COFFEE MUG |
| 103 | + 13.224326 (968) = CUP |
| 104 | + 10.422965 (505) = COFFEEPOT |
| 105 | +``` |
| 106 | +Please read the [QuickStart](docs/getting_started/quickstart.md) guide for additional information |
| 107 | +regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to Triton and wondering where to get started? Watch the [Getting Started video](https://youtu.be/NQDtfSi5QF4). |
| 108 | + |
| 109 | +## Examples and Tutorials |
| 110 | + |
| 111 | +Check out [NVIDIA LaunchPad](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/trial/) |
| 112 | +for free access to a set of hands-on labs with Triton Inference Server hosted on |
| 113 | +NVIDIA infrastructure. |
| 114 | + |
| 115 | +Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM |
| 116 | +are located in the |
| 117 | +[NVIDIA Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples) |
| 118 | +page on GitHub. The |
| 119 | +[NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-triton-inference-server) |
| 120 | +contains additional documentation, presentations, and examples. |
| 121 | + |
| 122 | +## Documentation |
| 123 | + |
| 124 | +### Build and Deploy |
| 125 | + |
| 126 | +The recommended way to build and use Triton Inference Server is with Docker |
| 127 | +images. |
| 128 | + |
| 129 | +- [Install Triton Inference Server with Docker containers](docs/customization_guide/build.md#building-with-docker) (*Recommended*) |
| 130 | +- [Install Triton Inference Server without Docker containers](docs/customization_guide/build.md#building-without-docker) |
| 131 | +- [Build a custom Triton Inference Server Docker container](docs/customization_guide/compose.md) |
| 132 | +- [Build Triton Inference Server from source](docs/customization_guide/build.md#building-on-unsupported-platforms) |
| 133 | +- [Build Triton Inference Server for Windows 10](docs/customization_guide/build.md#building-for-windows-10) |
| 134 | +- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md), |
| 135 | + [AWS](deploy/aws/README.md), and [NVIDIA FleetCommand](deploy/fleetcommand/README.md) |
| 136 | +- [Secure Deployment Considerations](docs/customization_guide/deploy.md) |
| 137 | + |
| 138 | +### Using Triton |
| 139 | + |
| 140 | +#### Preparing Models for Triton Inference Server |
| 141 | + |
| 142 | +The first step in using Triton to serve your models is to place one or |
| 143 | +more models into a [model repository](docs/user_guide/model_repository.md). Depending on |
| 144 | +the type of the model and on what Triton capabilities you want to enable for |
| 145 | +the model, you may need to create a [model |
| 146 | +configuration](docs/user_guide/model_configuration.md) for the model. |
| 147 | + |
| 148 | +- [Add custom operations to Triton if needed by your model](docs/user_guide/custom_operations.md) |
| 149 | +- Enable model pipelining with [Model Ensemble](docs/user_guide/architecture.md#ensemble-models) |
| 150 | + and [Business Logic Scripting (BLS)](https://github.com/triton-inference-server/python_backend#business-logic-scripting) |
| 151 | +- Optimize your models setting [scheduling and batching](docs/user_guide/architecture.md#models-and-schedulers) |
| 152 | + parameters and [model instances](docs/user_guide/model_configuration.md#instance-groups). |
| 153 | +- Use the [Model Analyzer tool](https://github.com/triton-inference-server/model_analyzer) |
| 154 | + to help optimize your model configuration with profiling |
| 155 | +- Learn how to [explicitly manage what models are available by loading and |
| 156 | + unloading models](docs/user_guide/model_management.md) |
| 157 | + |
| 158 | +#### Configure and Use Triton Inference Server |
| 159 | + |
| 160 | +- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton Inference |
| 161 | + Server on both GPU and CPU |
| 162 | +- Triton supports multiple execution engines, called |
| 163 | + [backends](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton), including |
| 164 | + [TensorRT](https://github.com/triton-inference-server/tensorrt_backend), |
| 165 | + [TensorFlow](https://github.com/triton-inference-server/tensorflow_backend), |
| 166 | + [PyTorch](https://github.com/triton-inference-server/pytorch_backend), |
| 167 | + [ONNX](https://github.com/triton-inference-server/onnxruntime_backend), |
| 168 | + [OpenVINO](https://github.com/triton-inference-server/openvino_backend), |
| 169 | + [Python](https://github.com/triton-inference-server/python_backend), and more |
| 170 | +- Not all the above backends are supported on every platform supported by Triton. |
| 171 | + Look at the |
| 172 | + [Backend-Platform Support Matrix](https://github.com/triton-inference-server/backend/blob/r24.04/docs/backend_platform_support_matrix.md) |
| 173 | + to learn which backends are supported on your target platform. |
| 174 | +- Learn how to [optimize performance](docs/user_guide/optimization.md) using the |
| 175 | + [Performance Analyzer](https://github.com/triton-inference-server/client/blob/r24.04/src/c++/perf_analyzer/README.md) |
| 176 | + and |
| 177 | + [Model Analyzer](https://github.com/triton-inference-server/model_analyzer) |
| 178 | +- Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in |
| 179 | + Triton |
| 180 | +- Send requests directly to Triton with the [HTTP/REST JSON-based |
| 181 | + or gRPC protocols](docs/customization_guide/inference_protocols.md#httprest-and-grpc-protocols) |
| 182 | + |
| 183 | +#### Client Support and Examples |
| 184 | + |
| 185 | +A Triton *client* application sends inference and other requests to Triton. The |
| 186 | +[Python and C++ client libraries](https://github.com/triton-inference-server/client) |
| 187 | +provide APIs to simplify this communication. |
| 188 | + |
| 189 | +- Review client examples for [C++](https://github.com/triton-inference-server/client/blob/r24.04/src/c%2B%2B/examples), |
| 190 | + [Python](https://github.com/triton-inference-server/client/blob/r24.04/src/python/examples), |
| 191 | + and [Java](https://github.com/triton-inference-server/client/blob/r24.04/src/java/src/main/java/triton/client/examples) |
| 192 | +- Configure [HTTP](https://github.com/triton-inference-server/client#http-options) |
| 193 | + and [gRPC](https://github.com/triton-inference-server/client#grpc-options) |
| 194 | + client options |
| 195 | +- Send input data (e.g. a jpeg image) directly to Triton in the [body of an HTTP |
| 196 | + request without any additional metadata](https://github.com/triton-inference-server/server/blob/r24.04/docs/protocol/extension_binary_data.md#raw-binary-request) |
| 197 | + |
| 198 | +### Extend Triton |
| 199 | + |
| 200 | +[Triton Inference Server's architecture](docs/user_guide/architecture.md) is specifically |
| 201 | +designed for modularity and flexibility |
| 202 | + |
| 203 | +- [Customize Triton Inference Server container](docs/customization_guide/compose.md) for your use case |
| 204 | +- [Create custom backends](https://github.com/triton-inference-server/backend) |
| 205 | + in either [C/C++](https://github.com/triton-inference-server/backend/blob/r24.04/README.md#triton-backend-api) |
| 206 | + or [Python](https://github.com/triton-inference-server/python_backend) |
| 207 | +- Create [decoupled backends and models](docs/user_guide/decoupled_models.md) that can send |
| 208 | + multiple responses for a request or not send any responses for a request |
| 209 | +- Use a [Triton repository agent](docs/customization_guide/repository_agents.md) to add functionality |
| 210 | + that operates when a model is loaded and unloaded, such as authentication, |
| 211 | + decryption, or conversion |
| 212 | +- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md) |
| 213 | +- [Use Triton on AWS |
| 214 | + Inferentia](https://github.com/triton-inference-server/python_backend/tree/main/inferentia) |
| 215 | + |
| 216 | +### Additional Documentation |
| 217 | + |
| 218 | +- [FAQ](docs/user_guide/faq.md) |
| 219 | +- [User Guide](docs/README.md#user-guide) |
| 220 | +- [Customization Guide](docs/README.md#customization-guide) |
| 221 | +- [Release Notes](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html) |
| 222 | +- [GPU, Driver, and CUDA Support |
| 223 | +Matrix](https://docs.nvidia.com/deeplearning/dgx/support-matrix/index.html) |
| 224 | + |
| 225 | +## Contributing |
| 226 | + |
| 227 | +Contributions to Triton Inference Server are more than welcome. To |
| 228 | +contribute please review the [contribution |
| 229 | +guidelines](CONTRIBUTING.md). If you have a backend, client, |
| 230 | +example or similar contribution that is not modifying the core of |
| 231 | +Triton, then you should file a PR in the [contrib |
| 232 | +repo](https://github.com/triton-inference-server/contrib). |
| 233 | + |
| 234 | +## Reporting problems, asking questions |
| 235 | + |
| 236 | +We appreciate any feedback, questions or bug reporting regarding this project. |
| 237 | +When posting [issues in GitHub](https://github.com/triton-inference-server/server/issues), |
| 238 | +follow the process outlined in the [Stack Overflow document](https://stackoverflow.com/help/mcve). |
| 239 | +Ensure posted examples are: |
| 240 | +- minimal – use as little code as possible that still produces the |
| 241 | + same problem |
| 242 | +- complete – provide all parts needed to reproduce the problem. Check |
| 243 | + if you can strip external dependencies and still show the problem. The |
| 244 | + less time we spend on reproducing problems the more time we have to |
| 245 | + fix it |
| 246 | +- verifiable – test the code you're about to provide to make sure it |
| 247 | + reproduces the problem. Remove all other problems that are not |
| 248 | + related to your request/question. |
| 249 | + |
| 250 | +For issues, please use the provided bug report and feature request templates. |
| 251 | + |
| 252 | +For questions, we recommend posting in our community |
| 253 | +[GitHub Discussions.](https://github.com/triton-inference-server/server/discussions) |
| 254 | + |
| 255 | +## For more information |
34 | 256 |
|
35 |
| -> [!WARNING] |
36 |
| -> ##### LATEST RELEASE |
37 |
| -> You are currently on the `24.04` branch which tracks under-development and unreleased features. |
| 257 | +Please refer to the [NVIDIA Developer Triton page](https://developer.nvidia.com/nvidia-triton-inference-server) |
| 258 | +for more information. |
0 commit comments