Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
trt-xsched-2206.patch	trt-xsched-2206.patch

XSched Integration for NVIDIA Triton Inference Server

This demonstrates how XSched can be integrated into NVIDIA Triton Inference Server to enable priority-based scheduling of multi-model inference tasks.

Basic Idea

Triton Server allows users to set the priority of each serving model. We modify the TensorRT-Backend of the Triton Server to create an XQueue for each CUDA stream and inherit the priority of the model using the stream. Then, we use local scheduler and highest priority first policy to schedule these XQueues. With XSched, the inference tasks of the model using the higher-priority XQueues can preempt the lower-priority ones (Triton does not support task preemption currently), so that their latencies can be significantly reduced.

Usage

Apply Integration Patch

# commit id: 7f94a8ee1daab23046ef4d689bd56411101f207c
git clone https://github.com/triton-inference-server/tensorrt_backend -b r22.06
cd tensorrt_backend
git apply <xsched_dir>/integration/triton/trt-xsched-2206.patch

Build XSched

# Setup Triton Server container and mount the xsched and tensorrt_backend directory (replace `<xsched_dir>` and `<tensorrt_backend_dir>` with your own location)
docker run --privileged -itd --name xsched-triton-server --gpus all --net=host -v <xsched_dir>:/xsched -v <tensorrt_backend_dir>:/root/tensorrt_backend nvcr.io/nvidia/tritonserver:22.06-py3 bash
docker exec -it xsched-triton-server bash

# Install Miniconda in the container
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh

source ~/miniconda3/bin/activate
conda init --all

# Install dependencies, we need cmake 3.17+ and rapidjson
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main
conda tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r
conda install -y cmake rapidjson

# Build XSched
cd /xsched
make cuda

Build TensorRT-Backend

docker exec -it xsched-triton-server bash
source ~/miniconda3/bin/activate

# Build TensorRT-Backend
cd /root/tensorrt_backend

mkdir -p build
cd build
cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_COMMON_REPO_TAG=r22.06 -DTRITON_CORE_REPO_TAG=r22.06 -DTRITON_BACKEND_REPO_TAG=r22.06 -DXSched_DIR=/xsched/output/lib/cmake/XSched ..
make -j$(nproc)
make install

Run Triton Server

In config.pbtxt, you can set the priority of each model.

optimization {
  priority: PRIORITY_MAX
}

or,

optimization {
  priority: PRIORITY_MIN
}

Then start the server

export LD_LIBRARY_PATH=/xsched/output/lib:$LD_LIBRARY_PATH
export XSCHED_POLICY=HPF
tritonserver --backend-directory ./install/backends --model-repository=<model_repo_dir> --strict-model-config false

Example

See inference serving example for concrete details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

XSched Integration for NVIDIA Triton Inference Server

Basic Idea

Usage

Apply Integration Patch

Build XSched

Build TensorRT-Backend

Run Triton Server

Example

FilesExpand file tree

triton

Directory actions

More options

Directory actions

More options

Latest commit

History

triton

Folders and files

parent directory

README.md

XSched Integration for NVIDIA Triton Inference Server

Basic Idea

Usage

Apply Integration Patch

Build XSched

Build TensorRT-Backend

Run Triton Server

Example