You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- vLLM nodes with LMCache CPU offloading (4 replicas) serving Llama 3.1 8B Instruct model
9
9
- Redis server
10
-
- LMCache
11
10
2.**Test with Example Application**: Run a Go application that:
12
11
- Connects to your deployed vLLM and Redis infrastructure,
13
12
- Demonstrates KV cache indexing by processing a sample prompt
14
-
- Shows how to retrieve pod scores
15
13
16
-
The end result is a working distributed LLM system with intelligent KV cache management that can route requests to pods with relevant cached computations.
14
+
The demonstrated KV-cache indexer is utilized for AI-aware routing to accelerate inference across the system through minimizing redundant computation.
17
15
18
16
## vLLM Deployment
19
17
@@ -95,13 +93,13 @@ Ensure you have a running deployment with vLLM and Redis as described above.
95
93
96
94
The vLLM node can be tested with the prompt found in `examples/kv-cache-index/main.go`.
97
95
98
-
First, ensure that the tokenizer engine is available locally for our example:
96
+
First, download the tokenizer bindings required by the `kvcache.Indexer`for prompt tokenization:
99
97
100
98
```bash
101
99
make download-tokenizer
102
100
```
103
101
104
-
This will download the tokenizer engine.
102
+
Then, set the required environment variables and run example:
0 commit comments