Skip to content

Commit c52dcc1

Browse files
oandreeva-nvrmccorm4krishung5
authored
docs: Add Semantic Caching Tutorial (#118)
--------- Co-authored-by: Ryan McCormick <[email protected]> Co-authored-by: Kris Hung <[email protected]>
1 parent 9d016f2 commit c52dcc1

File tree

4 files changed

+830
-0
lines changed

4 files changed

+830
-0
lines changed
Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
<!--
2+
# Copyright 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
#
4+
# Redistribution and use in source and binary forms, with or without
5+
# modification, are permitted provided that the following conditions
6+
# are met:
7+
# * Redistributions of source code must retain the above copyright
8+
# notice, this list of conditions and the following disclaimer.
9+
# * Redistributions in binary form must reproduce the above copyright
10+
# notice, this list of conditions and the following disclaimer in the
11+
# documentation and/or other materials provided with the distribution.
12+
# * Neither the name of NVIDIA CORPORATION nor the names of its
13+
# contributors may be used to endorse or promote products derived
14+
# from this software without specific prior written permission.
15+
#
16+
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
17+
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
18+
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
19+
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
20+
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
21+
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
22+
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
23+
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
24+
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
25+
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
26+
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
27+
-->
28+
29+
# Semantic Caching
30+
31+
When deploying large language models (LLMs) or LLM-based workflows
32+
there are two key factors to consider: the performance and cost-efficiency
33+
of your application. Generating language model outputs requires significant
34+
computational resources, for example GPU time, memory usage, and other
35+
infrastructure costs. These resource-intensive requirements create a
36+
pressing need for optimization strategies that can maintain
37+
high-quality outputs while minimizing operational expenses.
38+
39+
Semantic caching emerges as a powerful solution to reduce computational costs
40+
for LLM-based applications.
41+
42+
## Definition and Benefits
43+
44+
**_Semantic caching_** is a caching mechanism that takes into account
45+
the semantics of the incoming request, rather than just the raw data itself.
46+
It goes beyond simple key-value pairs and considers the content or
47+
context of the data.
48+
49+
This approach offers several benefits including, but not limited to:
50+
51+
+ **Cost Optimization**
52+
53+
- Semantic caching can substantially reduce operational expenses associated
54+
with LLM deployments. By storing and reusing responses for semantically
55+
similar queries, it minimizes the number of actual LLM calls required.
56+
57+
+ **Reduced Latency**
58+
59+
- One of the primary benefits of semantic caching is its ability to
60+
significantly improve response times. By retrieving cached responses for
61+
similar queries, the system can bypass the need for full model inference,
62+
resulting in reduced latency.
63+
64+
+ **Increased Throughput**
65+
66+
- Semantic caching allows for more efficient utilization of computational
67+
resources. By serving cached responses for similar queries, it reduces the
68+
load on infrastructure components. This efficiency enables the system
69+
to handle a higher volume of requests with the same hardware, effectively
70+
increasing throughput.
71+
72+
+ **Scalability**
73+
74+
- As the user base and the volume of queries grow, the probability of cache
75+
hits increases, provided that there is adequate storage and resources
76+
available to support this scaling. The improved resource efficiency and
77+
reduced computational demands allows applications to serve more users
78+
without a proportional increase in infrastructure costs.
79+
80+
+ **Consistency in Responses**
81+
82+
- For certain applications, maintaining consistency in responses to
83+
similar queries can be beneficial. Semantic caching ensures that analogous
84+
questions receive uniform answers, which can be particularly useful
85+
in scenarios like customer service or educational applications.
86+
87+
## Sample Reference Implementation
88+
89+
In this tutorial we provide a reference implementation for a Semantic Cache in
90+
[semantic_caching.py](./artifacts/semantic_caching.py). There are 3 key
91+
dependencies:
92+
* [SentenceTransformer](https://sbert.net/): a Python framework for computing
93+
dense vector representations (embeddings) of sentences, paragraphs, and images.
94+
- We use this library and `all-MiniLM-L6-v2` in particular to convert
95+
incoming prompt into an embedding, enabling semantic comparison.
96+
- Alternatives include [semantic search models](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#semantic-search-models),
97+
OpenAI Embeddings, etc.
98+
* [Faiss](https://github.com/facebookresearch/faiss/wiki): an open-source library
99+
developed by Facebook AI Research for efficient similarity search and
100+
clustering of dense vectors.
101+
- This library is used for the embedding store and extracting the most
102+
similar embedded prompt from the cached requests (or from the index store).
103+
- This is a mighty library with a great variety of CPU and GPU accelerated
104+
algorithms.
105+
- Alternatives include [annoy](https://github.com/spotify/annoy), or
106+
[cuVS](https://github.com/rapidsai/cuvs). However, note that cuVS already
107+
has an integration in Faiss, more on this can be found [here](https://docs.rapids.ai/api/cuvs/nightly/integrations/faiss/).
108+
* [Theine](https://github.com/Yiling-J/theine): High performance in-memory
109+
cache.
110+
- We will use it as our exact match cache backend. After the most similar
111+
prompt is identified, the corresponding cached response is extracted from
112+
the cache. This library supports multiple eviction policies, in this
113+
tutorial we use "LRU".
114+
- One may also look into [MemCached](https://memcached.org/about) as a
115+
potential alternative.
116+
117+
Provided [script](./artifacts/semantic_caching.py) is heavily annotated and we
118+
encourage users to look through the code to gain better clarity in all
119+
the necessary stages.
120+
121+
## Incorporating Semantic Cache into your workflow
122+
123+
For this tutorial, we'll use the [vllm backend](https://github.com/triton-inference-server/vllm_backend)
124+
as our example, focusing on demonstrating how to cache responses for the
125+
non-streaming case. The principles covered here can be extended to handle
126+
streaming scenarios as well.
127+
128+
### Customising vLLM Backend
129+
130+
First, let's start by cloning Triton's vllm backend repository. This will
131+
provide the necessary codebase to implement our semantic caching example.
132+
133+
```bash
134+
git clone https://github.com/triton-inference-server/vllm_backend.git
135+
cd vllm_backend
136+
```
137+
138+
With the repository successfully cloned, the next step is to apply all
139+
necessary modifications. To simplify this process, we've prepared a
140+
[semantic_cache.patch](tutorials/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch)
141+
that consolidates all changes into a single step:
142+
143+
```bash
144+
curl https://raw.githubusercontent.com/triton-inference-server/tutorials/refs/heads/main/Conceptual_Guide/Part_8-semantic_caching/artifacts/semantic_cache.patch | git apply -v
145+
```
146+
147+
If you're eager to start using Triton with the optimized vLLM backend,
148+
you can skip ahead to the
149+
[Launching Triton with Optimized vLLM Backend](#launching-triton-with-optimized-vllm-backend)
150+
section. However, for those interested in understanding the specifics,
151+
let's explore what this patch includes.
152+
153+
The patch introduces a new script,
154+
[semantic_caching.py](./artifacts/semantic_caching.py), which is added to the
155+
appropriate directory. This script implements the core logic for our
156+
semantic caching functionality.
157+
158+
Next, the patch integrates semantic caching into the model. Let's walk through
159+
these changes step-by-step.
160+
161+
Firstly, it imports the necessary classes from
162+
[semantic_caching.py](./artifacts/semantic_caching.py) into the codebase:
163+
164+
```diff
165+
...
166+
167+
from utils.metrics import VllmStatLogger
168+
+from utils.semantic_caching import SemanticCPUCacheConfig, SemanticCPUCache
169+
```
170+
171+
Next, it sets up the semantic cache during the initialization step.
172+
This setup will prepare your model to utilize semantic caching during
173+
its operations.
174+
175+
```diff
176+
def initialize(self, args):
177+
self.args = args
178+
self.logger = pb_utils.Logger
179+
self.model_config = json.loads(args["model_config"])
180+
...
181+
182+
# Starting asyncio event loop to process the received requests asynchronously.
183+
self._loop = asyncio.get_event_loop()
184+
self._event_thread = threading.Thread(
185+
target=self.engine_loop, args=(self._loop,)
186+
)
187+
self._shutdown_event = asyncio.Event()
188+
self._event_thread.start()
189+
+ config = SemanticCPUCacheConfig()
190+
+ self.semantic_cache = SemanticCPUCache(config=config)
191+
192+
```
193+
194+
Finally, the patch incorporates logic to query and update the semantic cache
195+
during request processing. This ensures that cached responses are efficiently
196+
utilized whenever possible.
197+
198+
```diff
199+
async def generate(self, request):
200+
...
201+
try:
202+
request_id = random_uuid()
203+
prompt = pb_utils.get_input_tensor_by_name(
204+
request, "text_input"
205+
).as_numpy()[0]
206+
...
207+
208+
if prepend_input and stream:
209+
raise ValueError(
210+
"When streaming, `exclude_input_in_output` = False is not allowed."
211+
)
212+
+ cache_hit = self.semantic_cache.get(prompt)
213+
+ if cache_hit:
214+
+ try:
215+
+ response_sender.send(
216+
+ self.create_response(cache_hit, prepend_input),
217+
+ flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,
218+
+ )
219+
+ if decrement_ongoing_request_count:
220+
+ self.ongoing_request_count -= 1
221+
+ except Exception as err:
222+
+ print(f"Unexpected {err=} for prompt {prompt}")
223+
+ return None
224+
...
225+
226+
async for output in response_iterator:
227+
...
228+
229+
last_output = output
230+
231+
if not stream:
232+
response_sender.send(
233+
self.create_response(last_output, prepend_input),
234+
flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL,
235+
)
236+
+ self.semantic_cache.set(prompt, last_output)
237+
238+
```
239+
240+
### Launching Triton with Optimized vLLM Backend
241+
242+
To evaluate or optimized vllm backend, let's start vllm docker container and
243+
mount our implementation to `/opt/tritonserver/backends/vllm`. We'll
244+
also mount sample model repository, provided in
245+
`vllm_backend/samples/model_repository`. Feel free to set up your own.
246+
Use the following docker command to start Triton's vllm docker container,
247+
but make sure to specify proper paths to the cloned `vllm_backend`
248+
repository and replace `<xx.yy>` with the latest release of Triton.
249+
250+
```bash
251+
docker run --gpus all -it --net=host --rm \
252+
--shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 \
253+
-v /path/to/vllm_backend/src/:/opt/tritonserver/backends/vllm \
254+
-v /path/to/vllm_backend/samples/model_repository:/workspace/model_repository \
255+
-w /workspace \
256+
nvcr.io/nvidia/tritonserver:<xx.yy>-vllm-python-py3
257+
```
258+
259+
When inside the container, make sure to install required dependencies:
260+
```bash
261+
pip install sentence_transformers faiss_gpu theine
262+
```
263+
264+
Finally, let's launch Triton
265+
```bash
266+
tritonserver --model-repository=model_repository/
267+
```
268+
269+
After you start Triton you will see output on the console showing
270+
the server starting up and loading the model. When you see output
271+
like the following, Triton is ready to accept inference requests.
272+
273+
```
274+
I1030 22:33:28.291908 1 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
275+
I1030 22:33:28.292879 1 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
276+
I1030 22:33:28.335154 1 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
277+
```
278+
279+
### Evaluation
280+
281+
After you [start Triton](#launching-triton-with-optimized-vllm-backend)
282+
with the sample model_repository, you can quickly run your first inference
283+
request with the
284+
[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md).
285+
286+
We'll also time this query:
287+
288+
```bash
289+
time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "Tell me, how do I create model repository for Triton Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}'
290+
```
291+
292+
Upon success, you should see a response from the server like this one:
293+
```
294+
{"model_name":"vllm_model","model_version":"1","text_output": <MODEL'S RESPONSE>}
295+
real 0m1.128s
296+
user 0m0.000s
297+
sys 0m0.015s
298+
```
299+
300+
Now, let's try a different response, but keep the semantics:
301+
302+
```bash
303+
time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "How do I set up model repository for Triton Inference Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}'
304+
```
305+
306+
Upon success, you should see a response from the server like this one:
307+
```
308+
{"model_name":"vllm_model","model_version":"1","text_output": <SAME MODEL'S RESPONSE>}
309+
real 0m0.038s
310+
user 0m0.000s
311+
sys 0m0.017s
312+
```
313+
314+
Let's try one more:
315+
316+
```bash
317+
time curl -X POST localhost:8000/v2/models/vllm_model/generate -d '{"text_input": "How model repository should be set up for Triton Server?", "parameters": {"stream": false, "temperature": 0, "max_tokens":100}, "exclude_input_in_output":true}'
318+
```
319+
320+
Upon success, you should see a response from the server like this one:
321+
```
322+
{"model_name":"vllm_model","model_version":"1","text_output": <SAME MODEL'S RESPONSE>}
323+
real 0m0.059s
324+
user 0m0.016s
325+
sys 0m0.000s
326+
```
327+
328+
Clearly, the latter 2 requests are semantically similar to the first one, which
329+
resulted in a cache hit scenario, which reduced the latency of our model from
330+
approx 1.1s to the average of 0.048s per request.
331+
332+
## Current Limitations
333+
334+
* The current implementation of the Semantic Cache only considers the prompt
335+
itself for cache hits, without accounting for additional request parameters
336+
such as `max_tokens` and `temperature`. As a result, these parameters are not
337+
included in the cache hit evaluation, which may affect the accuracy of cached
338+
responses when different configurations are used.
339+
340+
* Semantic Cache effectiveness is heavily reliant on the choice of embedding
341+
model and application context. For instance, queries like "How to set up model
342+
repository for Triton Inference Server?" and "How not to set up model
343+
repository for Triton Inference Server?" may have high cosine similarity
344+
despite differing semantically. This makes it challenging to set an optimal
345+
threshold for cache hits, as a narrow similarity range might exclude useful
346+
cache entries.
347+
348+
## Interested in This Feature?
349+
350+
While this reference implementation provides a glimpse into the potential
351+
of semantic caching, it's important to note that it's not an officially
352+
supported feature in Triton Inference Server.
353+
354+
We value your input! If you're interested in seeing semantic caching as a
355+
supported feature in future releases, we invite you to join the ongoing
356+
[discussion](https://github.com/triton-inference-server/server/discussions/7742).
357+
Provide details about why you think semantic caching would
358+
be valuable for your use case. Your feedback helps shape our product roadmap,
359+
and we appreciate your contributions to making our software better for everyone.

0 commit comments

Comments
 (0)