triton-inference-server · Apr 24, 2024
diff --git a/‎Conceptual_Guide/Part_7-iterative_scheduling/README.md
Lines changed: 135 additions & 0 deletions b/‎Conceptual_Guide/Part_7-iterative_scheduling/README.md
Lines changed: 135 additions & 0 deletions
diff --git a/‎Conceptual_Guide/Part_7-iterative_scheduling/client/client.py
Lines changed: 114 additions & 0 deletions b/‎Conceptual_Guide/Part_7-iterative_scheduling/client/client.py
Lines changed: 114 additions & 0 deletions
diff --git a/‎Conceptual_Guide/Part_7-iterative_scheduling/client/print_utils.py
Lines changed: 46 additions & 0 deletions b/‎Conceptual_Guide/Part_7-iterative_scheduling/client/print_utils.py
Lines changed: 46 additions & 0 deletions
diff --git a/‎Conceptual_Guide/Part_7-iterative_scheduling/input_data.json
Lines changed: 8 additions & 0 deletions b/‎Conceptual_Guide/Part_7-iterative_scheduling/input_data.json
Lines changed: 8 additions & 0 deletions
@@ -0,0 +1,135 @@
+<!--
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
+
+# Deploying a GPT-2 Model using Python Backend and Iterative Scheduling
+
+In this tutorial, we will deploy a GPT-2 model using the Python backend and
+demonstrate the
+[iterative scheduling](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html#iterative-sequences)
+feature.
+
+## Prerequisites
+
+Before getting started with this tutorial, make sure you're familiar
+with the following concepts:
+
+* [Triton-Server Quick Start](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html)
+* [Python Backend](https://github.com/triton-inference-server/python_backend)
+
+## Iterative Scheduling
+
+Iterative scheduling is a technique that allows the Triton Inference Server to
+schedule the same request multiple times with the same input. This is useful for
+models that have an auto-regressive loop. Iterative scheduling enables Triton
+Server to implement inflight batching for your models and gives you the ability
+to combine new sequences as they are arriving with inflight sequences.
+
+## Tutorial Overview
+
+In this tutorial we deploy two models:
+
+* simple-gpt2: This model receives a batch of requests and proceeds to the next
+batch only when it is done generating tokens for the current batch.
+
+* iterative-gpt2: This model uses iterative scheduling to process
+new sequences in a batch even when it is still generating tokens for the
+previous sequences
+
+### Demo
+
+[![asciicast](https://asciinema.org/a/TUZtHwZsYrJzHuZF7XCOj1Avx.svg)](https://asciinema.org/a/TUZtHwZsYrJzHuZF7XCOj1Avx)
+
+### Step 1: Prepare the Server Environment
+
+* First, run the Triton Inference Server Container:
+
+```
+# Replace yy.mm with year and month of release. Please use 24.04 release upward.
+docker run --gpus=all --name iterative-scheduling -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}:/workspace/ -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:yy.mm-py3 bash
+```
+
+* Next, install all the dependencies required by the models running in the
+python backend and login with your [huggingface token](https://huggingface.co/settings/tokens)
+(Account on [HuggingFace](https://huggingface.co/) is required).
+
+```
+pip install transformers[torch]
+```
+
+> [!NOTE]
+> Optional: If you want to avoid installing the dependencies each time you run the
+> container, you can run `docker commit iterative-scheduling iterative-scheduling-image` to save the container
+> and use that for subsequent runs.
+
+Then, start the server:
+
+```
+tritonserver --model-repository=/models
+```
+
+### Step 2: Install the client side dependencies
+
+In another terminal install the client dependencies:
+
+```
+pip3 install tritonclient[grpc]
+pip3 install tqdm
+```
+
+### Step 3: Run the client
+
+The simple-gpt2 model doesn't use iterative scheduling and will proceed to the
+next batch only when it is done generating tokens for the current batch.
+
+Run the following command to start the client:
+
+```
+python3 client/client.py --model simple-gpt2
+```
+
+As you can see, the tokens for one request are processed first before proceeding
+to the next request.
+
+Run `Ctrl+C` to stop the client.
+
+
+The iterative scheduler is able to incorporate new requests as they are arriving
+in the server.
+
+Run the following command to start the client:
+```
+python3 client/client.py --model iterative-gpt2
+```
+
+As you can see, the tokens for both prompts are getting generated simultaneously.
+
+## Next Steps
+
+We plan to integrate KV-Cache with these models for better performance. Currently,
+the main goal of tutorial is to demonstrate how to use iterative scheduling with
+Python backend.
@@ -0,0 +1,114 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+import argparse
+import threading
+import time
+from functools import partial
+
+import numpy as np
+import tritonclient.grpc as grpcclient
+from print_utils import Display
+
+
+def client1_callback(display, event, result, error):
+    if error:
+        raise error
+
+    display.update_top()
+    if result.get_response().parameters.get("triton_final_response").bool_param:
+        event.set()
+
+
+def client2_callback(display, event, result, error):
+    if error:
+        raise error
+
+    display.update_bottom()
+    if result.get_response().parameters.get("triton_final_response").bool_param:
+        event.set()
+
+
+def run_inferences(url, model_name, display, max_tokens):
+    # Create clients
+    client1 = grpcclient.InferenceServerClient(url)
+    client2 = grpcclient.InferenceServerClient(url)
+
+    inputs0 = []
+    prompt1 = "Programming in C++ is like"
+    inputs0.append(grpcclient.InferInput("text_input", [1, 1], "BYTES"))
+    inputs0[0].set_data_from_numpy(np.array([[prompt1]], dtype=np.object_))
+
+    prompt2 = "Programming in Assembly is like"
+    inputs1 = []
+    inputs1.append(grpcclient.InferInput("text_input", [1, 1], "BYTES"))
+    inputs1[0].set_data_from_numpy(np.array([[prompt2]], dtype=np.object_))
+
+    event1 = threading.Event()
+    event2 = threading.Event()
+    client1.start_stream(callback=partial(partial(client1_callback, display), event1))
+    client2.start_stream(callback=partial(partial(client2_callback, display), event2))
+
+    while True:
+        # Reset the events
+        event1.clear()
+        event2.clear()
+
+        # Setup the display initially with the prompts
+        display.clear()
+        parameters = {"ignore_eos": True, "max_tokens": max_tokens}
+
+        client1.async_stream_infer(
+            model_name=model_name,
+            inputs=inputs0,
+            enable_empty_final_response=True,
+            parameters=parameters,
+        )
+
+        # Add a small delay so that the two requests are not sent at the same
+        # time
+        time.sleep(0.05)
+        client2.async_stream_infer(
+            model_name=model_name,
+            inputs=inputs1,
+            enable_empty_final_response=True,
+            parameters=parameters,
+        )
+
+        event1.wait()
+        event2.wait()
+        time.sleep(2)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--url", type=str, default="localhost:8001")
+    parser.add_argument("--model", type=str, default="simple-gpt2")
+    parser.add_argument("--max-tokens", type=int, default=128)
+    args = parser.parse_args()
+    display = Display(args.max_tokens)
+
+    run_inferences(args.url, args.model, display, args.max_tokens)
@@ -0,0 +1,46 @@
+# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions
+# are met:
+#  * Redistributions of source code must retain the above copyright
+#    notice, this list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright
+#    notice, this list of conditions and the following disclaimer in the
+#    documentation and/or other materials provided with the distribution.
+#  * Neither the name of NVIDIA CORPORATION nor the names of its
+#    contributors may be used to endorse or promote products derived
+#    from this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+from tqdm import tqdm
+
+
+class Display:
+    def __init__(self, max_tokens) -> None:
+        self._top = tqdm(position=0, total=max_tokens, miniters=1)
+        self._bottom = tqdm(position=1, total=max_tokens, miniters=1)
+        self._max_tokens = max_tokens
+
+    def update_top(self):
+        self._top.update(1)
+        self._top.refresh()
+
+    def update_bottom(self):
+        self._bottom.update(1)
+        self._bottom.refresh()
+
+    def clear(self):
+        self._top.reset()
+        self._bottom.reset()
@@ -0,0 +1,8 @@
+{
+    "data":
+      [
+        {
+          "input": ["machine learning is"]
+        }
+      ]
+  }
-Original file line number
+Diff line change
@@ @@ -0,0 +1,8 @@ @@
 +{
 +    "data":
 +      [
 +        {
 +          "input": ["machine learning is"]
 +        }
 +      ]
 +  }