Skip to content

Add initial integration of iterative scheduling (#88) #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions Conceptual_Guide/Part_7-iterative_scheduling/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
<!--
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

# Deploying a GPT-2 Model using Python Backend and Iterative Scheduling

In this tutorial, we will deploy a GPT-2 model using the Python backend and
demonstrate the
[iterative scheduling](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#iterative-sequences)
feature.

## Prerequisites

Before getting started with this tutorial, make sure you're familiar
with the following concepts:

* [Triton-Server Quick Start](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
* [Python Backend](https://github.com/triton-inference-server/python_backend)

## Iterative Scheduling

Iterative scheduling is a technique that allows the Triton Inference Server to
schedule the same request multiple times with the same input. This is useful for
models that have an auto-regressive loop. Iterative scheduling enables Triton
Server to implement inflight batching for your models and gives you the ability
to combine new sequences as they are arriving with inflight sequences.

## Tutorial Overview

In this tutorial we deploy two models:

* simple-gpt2: This model receives a batch of requests and proceeds to the next
batch only when it is done generating tokens for the current batch.

* iterative-gpt2: This model uses iterative scheduling to process
new sequences in a batch even when it is still generating tokens for the
previous sequences

### Demo

[![asciicast](https://asciinema.org/a/TUZtHwZsYrJzHuZF7XCOj1Avx.svg)](https://asciinema.org/a/TUZtHwZsYrJzHuZF7XCOj1Avx)

### Step 1: Prepare the Server Environment

* First, run the Triton Inference Server Container:

```
# Replace yy.mm with year and month of release. Please use 24.04 release upward.
docker run --gpus=all --name iterative-scheduling -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
```

* Next, install all the dependencies required by the models running in the
python backend and login with your [huggingface token](https://huggingface.co/settings/tokens)
(Account on [HuggingFace](https://huggingface.co/) is required).

```
pip install transformers[torch]
```

> [!NOTE]
> Optional: If you want to avoid installing the dependencies each time you run the
> container, you can run `docker commit iterative-scheduling iterative-scheduling-image` to save the container
> and use that for subsequent runs.

Then, start the server:

```
tritonserver --model-repository=/models
```

### Step 2: Install the client side dependencies

In another terminal install the client dependencies:

```
pip3 install tritonclient[grpc]
pip3 install tqdm
```

### Step 3: Run the client

The simple-gpt2 model doesn't use iterative scheduling and will proceed to the
next batch only when it is done generating tokens for the current batch.

Run the following command to start the client:

```
python3 client/client.py --model simple-gpt2
```

As you can see, the tokens for one request are processed first before proceeding
to the next request.

Run `Ctrl+C` to stop the client.


The iterative scheduler is able to incorporate new requests as they are arriving
in the server.

Run the following command to start the client:
```
python3 client/client.py --model iterative-gpt2
```

As you can see, the tokens for both prompts are getting generated simultaneously.

## Next Steps

We plan to integrate KV-Cache with these models for better performance. Currently,
the main goal of tutorial is to demonstrate how to use iterative scheduling with
Python backend.
114 changes: 114 additions & 0 deletions Conceptual_Guide/Part_7-iterative_scheduling/client/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import argparse
import threading
import time
from functools import partial

import numpy as np
import tritonclient.grpc as grpcclient
from print_utils import Display


def client1_callback(display, event, result, error):
if error:
raise error

display.update_top()
if result.get_response().parameters.get("triton_final_response").bool_param:
event.set()


def client2_callback(display, event, result, error):
if error:
raise error

display.update_bottom()
if result.get_response().parameters.get("triton_final_response").bool_param:
event.set()


def run_inferences(url, model_name, display, max_tokens):
# Create clients
client1 = grpcclient.InferenceServerClient(url)
client2 = grpcclient.InferenceServerClient(url)

inputs0 = []
prompt1 = "Programming in C++ is like"
inputs0.append(grpcclient.InferInput("text_input", [1, 1], "BYTES"))
inputs0[0].set_data_from_numpy(np.array([[prompt1]], dtype=np.object_))

prompt2 = "Programming in Assembly is like"
inputs1 = []
inputs1.append(grpcclient.InferInput("text_input", [1, 1], "BYTES"))
inputs1[0].set_data_from_numpy(np.array([[prompt2]], dtype=np.object_))

event1 = threading.Event()
event2 = threading.Event()
client1.start_stream(callback=partial(partial(client1_callback, display), event1))
client2.start_stream(callback=partial(partial(client2_callback, display), event2))

while True:
# Reset the events
event1.clear()
event2.clear()

# Setup the display initially with the prompts
display.clear()
parameters = {"ignore_eos": True, "max_tokens": max_tokens}

client1.async_stream_infer(
model_name=model_name,
inputs=inputs0,
enable_empty_final_response=True,
parameters=parameters,
)

# Add a small delay so that the two requests are not sent at the same
# time
time.sleep(0.05)
client2.async_stream_infer(
model_name=model_name,
inputs=inputs1,
enable_empty_final_response=True,
parameters=parameters,
)

event1.wait()
event2.wait()
time.sleep(2)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--url", type=str, default="localhost:8001")
parser.add_argument("--model", type=str, default="simple-gpt2")
parser.add_argument("--max-tokens", type=int, default=128)
args = parser.parse_args()
display = Display(args.max_tokens)

run_inferences(args.url, args.model, display, args.max_tokens)
46 changes: 46 additions & 0 deletions Conceptual_Guide/Part_7-iterative_scheduling/client/print_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of NVIDIA CORPORATION nor the names of its
# contributors may be used to endorse or promote products derived
# from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

from tqdm import tqdm


class Display:
def __init__(self, max_tokens) -> None:
self._top = tqdm(position=0, total=max_tokens, miniters=1)
self._bottom = tqdm(position=1, total=max_tokens, miniters=1)
self._max_tokens = max_tokens

def update_top(self):
self._top.update(1)
self._top.refresh()

def update_bottom(self):
self._bottom.update(1)
self._bottom.refresh()

def clear(self):
self._top.reset()
self._bottom.reset()
8 changes: 8 additions & 0 deletions Conceptual_Guide/Part_7-iterative_scheduling/input_data.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"data":
[
{
"input": ["machine learning is"]
}
]
}
Loading
Loading