|
| 1 | +<!-- |
| 2 | +# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +# |
| 4 | +# Redistribution and use in source and binary forms, with or without |
| 5 | +# modification, are permitted provided that the following conditions |
| 6 | +# are met: |
| 7 | +# * Redistributions of source code must retain the above copyright |
| 8 | +# notice, this list of conditions and the following disclaimer. |
| 9 | +# * Redistributions in binary form must reproduce the above copyright |
| 10 | +# notice, this list of conditions and the following disclaimer in the |
| 11 | +# documentation and/or other materials provided with the distribution. |
| 12 | +# * Neither the name of NVIDIA CORPORATION nor the names of its |
| 13 | +# contributors may be used to endorse or promote products derived |
| 14 | +# from this software without specific prior written permission. |
| 15 | +# |
| 16 | +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY |
| 17 | +# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE |
| 18 | +# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR |
| 19 | +# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR |
| 20 | +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, |
| 21 | +# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, |
| 22 | +# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR |
| 23 | +# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY |
| 24 | +# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
| 25 | +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE |
| 26 | +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
| 27 | +--> |
| 28 | + |
| 29 | +# Semantic Caching |
| 30 | + |
| 31 | +When deploying large language models (LLMs) or LLM-based workflows |
| 32 | +there are two key factors to consider: the performance and cost-efficiency |
| 33 | +of your application. Generating language model outputs requires significant |
| 34 | +computational resources, for example GPU time, memory usage, and other |
| 35 | +infrastructure costs. These resource-intensive requirements create a |
| 36 | +pressing need for optimization strategies that can maintain |
| 37 | +high-quality outputs while minimizing operational expenses. |
| 38 | + |
| 39 | +Semantic caching emerges as a powerful solution to reduce computational costs |
| 40 | +for LLM-based applications. |
| 41 | + |
| 42 | +## Definition and Benefits |
| 43 | + |
| 44 | +**_Semantic caching_** is a caching mechanism that takes into account |
| 45 | +the semantics of the incoming request, rather than just the raw data itself. |
| 46 | +It goes beyond simple key-value pairs and considers the content or |
| 47 | +context of the data. |
| 48 | + |
| 49 | +This approach offers several benefits including, but not limited to: |
| 50 | + |
| 51 | ++ **Cost Optimization** |
| 52 | + |
| 53 | + - Semantic caching can substantially reduce operational expenses associated |
| 54 | + with LLM deployments. By storing and reusing responses for semantically |
| 55 | + similar queries, it minimizes the number of actual LLM calls required. |
| 56 | + |
| 57 | ++ **Reduced Latency** |
| 58 | + |
| 59 | + - One of the primary benefits of semantic caching is its ability to |
| 60 | + significantly improve response times. By retrieving cached responses for |
| 61 | + similar queries, the system can bypass the need for full model inference, |
| 62 | + resulting in reduced latency. |
| 63 | + |
| 64 | ++ **Increased Throughput** |
| 65 | + |
| 66 | + - Semantic caching allows for more efficient utilization of computational |
| 67 | + resources. By serving cached responses for similar queries, it reduces the |
| 68 | + load on infrastructure components. This efficiency enables the system |
| 69 | + to handle a higher volume of requests with the same hardware, effectively |
| 70 | + increasing throughput. |
| 71 | + |
| 72 | ++ **Scalability** |
| 73 | + |
| 74 | + - As the user base and the volume of queries grow, the probability of cache |
| 75 | + hits increases, provided that there is adequate storage and resources |
| 76 | + available to support this scaling. The improved resource efficiency and |
| 77 | + reduced computational demands allows applications to serve more users |
| 78 | + without a proportional increase in infrastructure costs. |
| 79 | + |
| 80 | ++ **Consistency in Responses** |
| 81 | + |
| 82 | + - For certain applications, maintaining consistency in responses to |
| 83 | + similar queries can be beneficial. Semantic caching ensures that analogous |
| 84 | + questions receive uniform answers, which can be particularly useful |
| 85 | + in scenarios like customer service or educational applications. |
| 86 | + |
| 87 | +## Sample Reference Implementation |
| 88 | + |
| 89 | +In this tutorial we provide a reference implementation for a Semantic Cache in |
| 90 | +[semantic_caching.py](./artifacts/semantic_caching.py). There are 3 key |
| 91 | +dependencies: |
| 92 | +* [SentenceTransformer](https://sbert.net/): a Python framework for computing |
| 93 | +dense vector representations (embeddings) of sentences, paragraphs, and images. |
| 94 | + - We use this library and `all-MiniLM-L6-v2` in particular to convert |
| 95 | + incoming prompt into an embedding, enabling semantic comparison. |
| 96 | + - Alternatives include [semantic search models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-search-models), |
| 97 | + OpenAI Embeddings, etc. |
| 98 | +* [Faiss](https://github.com/facebookresearch/faiss/wiki): an open-source library |
| 99 | +developed by Facebook AI Research for efficient similarity search and |
| 100 | +clustering of dense vectors. |
| 101 | + - This library is used for the embedding store and extracting the most |
| 102 | + similar embedded prompt from the cached requests (or from the index store). |
| 103 | + - This is a mighty library with a great variety of CPU and GPU accelerated |
| 104 | + algorithms. |
| 105 | + - Alternatives include [annoy](https://github.com/spotify/annoy), or |
| 106 | + [cuVS](https://github.com/rapidsai/cuvs). However, note that cuVS already |
| 107 | + has an integration in Faiss, more on this can be found [here](https://docs.rapids.ai/api/cuvs/nightly/integrations/faiss/). |
| 108 | +* [Theine](https://github.com/Yiling-J/theine): High performance in-memory |
| 109 | +cache. |
| 110 | + - We will use it as our exact match cache backend. After the most similar |
| 111 | + prompt is identified, the corresponding cached response is extracted from |
| 112 | + the cache. This library supports multiple eviction policies, in this |
| 113 | + tutorial we use "LRU". |
| 114 | + - One may also look into [MemCached](https://memcached.org/about) as a |
| 115 | + potential alternative. |
| 116 | + |
| 117 | +Provided [script](./artifacts/semantic_caching.py) is heavily annotated and we |
| 118 | +encourage users to look through the code to gain better clarity in all |
| 119 | +the necessary stages. |
| 120 | + |
| 121 | +## Incorporating Semantic Cache into your workflow |
| 122 | + |
| 123 | +For this tutorial, we'll use the [vllm backend](https://github.com/triton-inference-server/vllm_backend) |
| 124 | +as our example, focusing on demonstrating how to cache responses for the |
| 125 | +non-streaming case. The principles covered here can be extended to handle |
| 126 | +streaming scenarios as well. |
| 127 | + |
| 128 | +### Customising vLLM Backend |
| 129 | + |
| 130 | +First, let's start by cloning Triton's vllm backend repository. This will |
| 131 | +provide the necessary codebase to implement our semantic caching example. |
| 132 | + |
| 133 | +```bash |
| 134 | +git clone https://github.com/triton-inference-server/vllm_backend.git |
| 135 | +cd vllm_backend |
| 136 | +``` |
| 137 | + |
| 138 | +With the repository successfully cloned, the next step is to apply all |
| 139 | +necessary modifications. To simplify this process, we've prepared a |
| 140 | +[semantic_cache.patch](tutorials/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch) |
| 141 | +that consolidates all changes into a single step: |
| 142 | + |
| 143 | +```bash |
| 144 | +curl https://raw.githubusercontent.com/triton-inference-server/tutorials/refs/heads/main/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch | git apply -v |
| 145 | +``` |
| 146 | + |
| 147 | +If you're eager to start using Triton with the optimized vLLM backend, |
| 148 | +you can skip ahead to the |
| 149 | +[Launching Triton with Optimized vLLM Backend](#launching-triton-with-optimized-vllm-backend) |
| 150 | +section. However, for those interested in understanding the specifics, |
| 151 | +let's explore what this patch includes. |
| 152 | + |
| 153 | +The patch introduces a new script, |
| 154 | +[semantic_caching.py](./artifacts/semantic_caching.py), which is added to the |
| 155 | +appropriate directory. This script implements the core logic for our |
| 156 | +semantic caching functionality. |
| 157 | + |
| 158 | +Next, the patch integrates semantic caching into the model. Let's walk through |
| 159 | +these changes step-by-step. |
| 160 | + |
| 161 | +Firstly, it imports the necessary classes from |
| 162 | +[semantic_caching.py](./artifacts/semantic_caching.py) into the codebase: |
| 163 | + |
| 164 | +```diff |
| 165 | +... |
| 166 | + |
| 167 | +from utils.metrics import VllmStatLogger |
| 168 | ++from utils.semantic_caching import SemanticCPUCacheConfig, SemanticCPUCache |
| 169 | +``` |
| 170 | + |
| 171 | +Next, it sets up the semantic cache during the initialization step. |
| 172 | +This setup will prepare your model to utilize semantic caching during |
| 173 | +its operations. |
| 174 | + |
| 175 | +```diff |
| 176 | + def initialize(self, args): |
| 177 | + self.args = args |
| 178 | + self.logger = pb_utils.Logger |
| 179 | + self.model_config = json.loads(args["model_config"]) |
| 180 | + ... |
| 181 | + |
| 182 | + # Starting asyncio event loop to process the received requests asynchronously. |
| 183 | + self._loop = asyncio.get_event_loop() |
| 184 | + self._event_thread = threading.Thread( |
| 185 | + target=self.engine_loop, args=(self._loop,) |
| 186 | + ) |
| 187 | + self._shutdown_event = asyncio.Event() |
| 188 | + self._event_thread.start() |
| 189 | ++ config = SemanticCPUCacheConfig() |
| 190 | ++ self.semantic_cache = SemanticCPUCache(config=config) |
| 191 | + |
| 192 | +``` |
| 193 | + |
| 194 | +Finally, the patch incorporates logic to query and update the semantic cache |
| 195 | +during request processing. This ensures that cached responses are efficiently |
| 196 | +utilized whenever possible. |
| 197 | + |
| 198 | +```diff |
| 199 | + async def generate(self, request): |
| 200 | + ... |
| 201 | + try: |
| 202 | + request_id = random_uuid() |
| 203 | + prompt = pb_utils.get_input_tensor_by_name( |
| 204 | + request, "text_input" |
| 205 | + ).as_numpy()[0] |
| 206 | + ... |
| 207 | + |
| 208 | + if prepend_input and stream: |
| 209 | + raise ValueError( |
| 210 | + "When streaming, `exclude_input_in_output` = False is not allowed." |
| 211 | + ) |
| 212 | ++ cache_hit = self.semantic_cache.get(prompt) |
| 213 | ++ if cache_hit: |
| 214 | ++ try: |
| 215 | ++ response_sender.send( |
| 216 | ++ self.create_response(cache_hit, prepend_input), |
| 217 | ++ flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, |
| 218 | ++ ) |
| 219 | ++ if decrement_ongoing_request_count: |
| 220 | ++ self.ongoing_request_count -= 1 |
| 221 | ++ except Exception as err: |
| 222 | ++ print(f"Unexpected {err=} for prompt {prompt}") |
| 223 | ++ return None |
| 224 | + ... |
| 225 | + |
| 226 | + async for output in response_iterator: |
| 227 | + ... |
| 228 | + |
| 229 | + last_output = output |
| 230 | + |
| 231 | + if not stream: |
| 232 | + response_sender.send( |
| 233 | + self.create_response(last_output, prepend_input), |
| 234 | + flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL, |
| 235 | + ) |
| 236 | ++ self.semantic_cache.set(prompt, last_output) |
| 237 | + |
| 238 | +``` |
| 239 | + |
| 240 | +### Launching Triton with Optimized vLLM Backend |
| 241 | + |
| 242 | +To evaluate or optimized vllm backend, let's start vllm docker container and |
| 243 | +mount our implementation to `/opt/tritonserver/backends/vllm`. We'll |
| 244 | +also mount sample model repository, provided in |
| 245 | +`vllm_backend/samples/model_repository`. Feel free to set up your own. |
| 246 | +Use the following docker command to start Triton's vllm docker container, |
| 247 | +but make sure to specify proper paths to the cloned `vllm_backend` |
| 248 | +repository and replace `<xx.yy>` with the latest release of Triton. |
| 249 | + |
| 250 | +```bash |
| 251 | +docker run --gpus all -it --net=host --rm \ |
| 252 | + --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \ |
| 253 | + -v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \ |
| 254 | + -v /path/to/vllm_backend/samples/model_repository:/workspace/model_repository \ |
| 255 | + -w /workspace \ |
| 256 | + nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3 |
| 257 | +``` |
| 258 | + |
| 259 | +When inside the container, make sure to install required dependencies: |
| 260 | +```bash |
| 261 | +pip install sentence_transformers faiss_gpu theine |
| 262 | +``` |
| 263 | + |
| 264 | +Finally, let's launch Triton |
| 265 | +```bash |
| 266 | +tritonserver --model-repository=model_repository/ |
| 267 | +``` |
| 268 | + |
| 269 | +After you start Triton you will see output on the console showing |
| 270 | +the server starting up and loading the model. When you see output |
| 271 | +like the following, Triton is ready to accept inference requests. |
| 272 | + |
| 273 | +``` |
| 274 | +I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001 |
| 275 | +I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000 |
| 276 | +I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002 |
| 277 | +``` |
| 278 | + |
| 279 | +### Evaluation |
| 280 | + |
| 281 | +After you [start Triton](#launching-triton-with-optimized-vllm-backend) |
| 282 | +with the sample model_repository, you can quickly run your first inference |
| 283 | +request with the |
| 284 | +[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md). |
| 285 | + |
| 286 | +We'll also time this query: |
| 287 | + |
| 288 | +```bash |
| 289 | +time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "Tell me, how do I create model repository for Triton Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}' |
| 290 | +``` |
| 291 | + |
| 292 | +Upon success, you should see a response from the server like this one: |
| 293 | +``` |
| 294 | +{"model_name":"vllm_model","model_version":"1","text_output": <MODEL'S RESPONSE>} |
| 295 | +real 0m1.128s |
| 296 | +user 0m0.000s |
| 297 | +sys 0m0.015s |
| 298 | +``` |
| 299 | + |
| 300 | +Now, let's try a different response, but keep the semantics: |
| 301 | + |
| 302 | +```bash |
| 303 | +time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "How do I set up model repository for Triton Inference Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}' |
| 304 | +``` |
| 305 | + |
| 306 | +Upon success, you should see a response from the server like this one: |
| 307 | +``` |
| 308 | +{"model_name":"vllm_model","model_version":"1","text_output": <SAME MODEL'S RESPONSE>} |
| 309 | +real 0m0.038s |
| 310 | +user 0m0.000s |
| 311 | +sys 0m0.017s |
| 312 | +``` |
| 313 | + |
| 314 | +Let's try one more: |
| 315 | + |
| 316 | +```bash |
| 317 | +time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "How model repository should be set up for Triton Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}' |
| 318 | +``` |
| 319 | + |
| 320 | +Upon success, you should see a response from the server like this one: |
| 321 | +``` |
| 322 | +{"model_name":"vllm_model","model_version":"1","text_output": <SAME MODEL'S RESPONSE>} |
| 323 | +real 0m0.059s |
| 324 | +user 0m0.016s |
| 325 | +sys 0m0.000s |
| 326 | +``` |
| 327 | + |
| 328 | +Clearly, the latter 2 requests are semantically similar to the first one, which |
| 329 | +resulted in a cache hit scenario, which reduced the latency of our model from |
| 330 | +approx 1.1s to the average of 0.048s per request. |
| 331 | + |
| 332 | +## Current Limitations |
| 333 | + |
| 334 | +* The current implementation of the Semantic Cache only considers the prompt |
| 335 | +itself for cache hits, without accounting for additional request parameters |
| 336 | +such as `max_tokens` and `temperature`. As a result, these parameters are not |
| 337 | +included in the cache hit evaluation, which may affect the accuracy of cached |
| 338 | +responses when different configurations are used. |
| 339 | + |
| 340 | +* Semantic Cache effectiveness is heavily reliant on the choice of embedding |
| 341 | +model and application context. For instance, queries like "How to set up model |
| 342 | +repository for Triton Inference Server?" and "How not to set up model |
| 343 | +repository for Triton Inference Server?" may have high cosine similarity |
| 344 | +despite differing semantically. This makes it challenging to set an optimal |
| 345 | +threshold for cache hits, as a narrow similarity range might exclude useful |
| 346 | +cache entries. |
| 347 | + |
| 348 | +## Interested in This Feature? |
| 349 | + |
| 350 | +While this reference implementation provides a glimpse into the potential |
| 351 | +of semantic caching, it's important to note that it's not an officially |
| 352 | +supported feature in Triton Inference Server. |
| 353 | + |
| 354 | +We value your input! If you're interested in seeing semantic caching as a |
| 355 | +supported feature in future releases, we invite you to join the ongoing |
| 356 | +[discussion](https://github.com/triton-inference-server/server/discussions/7742). |
| 357 | +Provide details about why you think semantic caching would |
| 358 | +be valuable for your use case. Your feedback helps shape our product roadmap, |
| 359 | +and we appreciate your contributions to making our software better for everyone. |
0 commit comments