The example presents a deployment of Multi-Instance ResNet50 PyTorch model. The model is deployed multiple times what improve the throughput of the model when GPU is underutilized. The model by default is deployed on same GPU twice.
Example consists of following scripts:
install.sh
- install additional dependencies for downloading model from HuggingFaceserver.py
- start the model with Triton Inference Serverclient.sh
- execute Perf Analyzer to measure the performance
The example requires the torch
package. It can be installed in your current environment using pip:
pip install torch
Or you can use NVIDIA PyTorch container:
docker run -it --gpus 1 --shm-size 8gb -v {repository_path}:{repository_path} -w {repository_path} nvcr.io/nvidia/pytorch:24.10-py3 bash
If you select to use container we recommend to install NVIDIA Container Toolkit.
The step-by-step guide:
- Install PyTriton following the installation instruction
- Install the additional packages using
install.sh
./install.sh
- In current terminal start the model on Triton using
server.py
./server.py
- Open new terminal tab (ex.
Ctrl + T
on Ubuntu) or window - Go to the example directory
- Run the
client.sh
to run performance measurement on model:
./client.sh