Reference architecture for using Valkey as a high-performance RAG (Retrieval-Augmented Generation) retrieval cache
This repository provides a comprehensive reference implementation for using Valkey as a caching layer in RAG (Retrieval-Augmented Generation) applications. It includes semantic caching, vector similarity search, and intelligent cache invalidation patterns that dramatically reduce latency and costs in LLM-powered applications.
| Capability | Benefit |
|---|---|
| Sub-millisecond latency | Cache hits return in <1ms vs 500ms+ for LLM calls |
| Semantic similarity search | Match queries by meaning, not just exact text |
| Vector search built-in | Native HNSW/FLAT indexing for embeddings |
| Cost reduction | 70-90% reduction in LLM API calls with proper caching |
| True open source | BSD-3 licensed, no commercial restrictions |
Valkey is a community-driven fork of Redis 7.2.4, maintained under the Linux Foundation. Here's why it's the superior choice for RAG applications:
| Feature | Valkey | Redis |
|---|---|---|
| License | BSD-3 (True Open Source) | SSPL/RSALv2 (Restrictive) |
| Commercial Cloud Use | β No restrictions | |
| Vector Search | β Native module (BSD-3) | |
| Community Governance | β Linux Foundation | β Single company controlled |
| AWS ElastiCache Support | β Native support | |
| Fork Compatibility | β Redis protocol compatible | N/A |
- valkey-search Module - Native vector similarity search without licensing concerns
- RDMA Support - Ultra-low latency for high-performance deployments
- Improved Cluster Scaling - Better slot migration for large clusters
- Enhanced Memory Efficiency - Optimized memory allocator options
- Active Defragmentation - Improved memory management for long-running caches
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RAG Application β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ β
β β Query β β Semantic β β LLM Response β β
β β Handler ββββββΆβ Cache Layer ββββββΆβ Generator β β
β β β β (Valkey) β β (OpenAI/Bedrock) β β
β βββββββββββββββ ββββββββββ¬ββββββββββ βββββββββββββββββββββββββββ β
β β β
β βββββββββββββΌββββββββββββ β
β β Vector Retrieval β β
β β Cache (Valkey) β β
β β βββββββββββββββββ β β
β β β’ Document chunks β β
β β β’ Embeddings β β
β β β’ Metadata β β
β βββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Valkey Cluster β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Primary Node (Shard 1) β Primary Node (Shard 2) β Primary (Shard N)ββ
β β ββ Semantic Cache β ββ Semantic Cache β ββ Semantic Cacheββ
β β ββ Vector Index (HNSW) β ββ Vector Index (HNSW) β ββ Vector Index ββ
β β ββ Response Cache β ββ Response Cache β ββ Response Cacheββ
β β β β β β β ββ
β β βΌ β βΌ β βΌ ββ
β β Replica Node β Replica Node β Replica Node ββ
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Docker & Docker Compose
- Python 3.11+ / Node.js 18+ / Go 1.21+
- OpenAI API key (or Ollama for local development)
cd deployment/docker
docker-compose up -dcd examples/python/semantic-cache
pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"
python main.py# First query - cache miss, calls LLM
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is machine learning?"}'
# Similar query - cache hit!
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "Explain machine learning to me"}'valkey-rag-cache/
βββ README.md # This file
βββ ARCHITECTURE.md # Detailed architecture documentation
βββ LICENSE # Apache 2.0
β
βββ reference-architecture/
β βββ diagrams/ # Architecture diagrams (SVG, PNG)
β βββ design-decisions.md # ADRs and design rationale
β
βββ examples/
β βββ python/ # Python implementations
β β βββ semantic-cache/ # Semantic caching demo
β β βββ vector-search/ # Vector similarity search
β β βββ rag-pipeline/ # Full RAG pipeline
β β βββ hybrid-search/ # Vector + keyword hybrid
β β
β βββ typescript/ # TypeScript implementations
β β βββ semantic-cache/
β β βββ vector-search/
β β βββ rag-pipeline/
β β
β βββ go/ # Go implementations
β βββ semantic-cache/
β βββ vector-search/
β βββ rag-pipeline/
β
βββ cookbooks/ # Step-by-step guides
β βββ 01-getting-started.md
β βββ 02-semantic-caching.md
β βββ 03-vector-search-patterns.md
β βββ 04-cache-invalidation.md
β βββ 05-scaling-production.md
β βββ 06-monitoring-observability.md
β
βββ deployment/
β βββ docker/ # Local development
β βββ kubernetes/ # K8s deployment
β βββ aws/ # AWS deployment (CDK, CloudFormation)
β
βββ benchmarks/ # Performance benchmarks
β βββ latency/
β βββ throughput/
β βββ comparison/
β
βββ tests/ # Test suites
βββ integration/
βββ performance/
Cache LLM responses based on semantic similarity of queries. When a user asks a question similar to a previously asked one, return the cached response instead of calling the LLM again.
Key Features:
- Configurable similarity threshold (default: 0.92)
- TTL-based expiration
- Cache hit/miss metrics
- Multi-tenant support
π Python Example | π TypeScript Example | π Go Example
Store and retrieve document chunks using vector similarity search. This is the core retrieval mechanism for RAG applications.
Key Features:
- HNSW index for fast approximate search
- FLAT index for exact search (smaller datasets)
- Metadata filtering
- Hybrid search (vector + keyword)
π Python Example | π TypeScript Example | π Go Example
End-to-end RAG implementation combining document ingestion, vector retrieval, semantic caching, and LLM response generation.
Key Features:
- Document chunking strategies
- Embedding generation (OpenAI, Bedrock, Ollama)
- Multi-level caching
- Response streaming
π Python Example | π TypeScript Example | π Go Example
| Cookbook | Description |
|---|---|
| Getting Started | Local setup, basic operations, first semantic cache |
| Semantic Caching | Threshold tuning, cache warming, best practices |
| Vector Search Patterns | Index types, hybrid search, re-ranking |
| Cache Invalidation | TTL strategies, event-driven invalidation |
| Production Scaling | Cluster mode, replication, sharding |
| Monitoring | Metrics, alerting, debugging |
| Provider | Embeddings | Chat/Completion | Local |
|---|---|---|---|
| OpenAI | β text-embedding-3-large | β GPT-4, GPT-4o | β |
| Amazon Bedrock | β Titan Embeddings | β Claude 3.5, Llama | β |
| Ollama | β nomic-embed-text | β Llama, Mistral | β |
| Hugging Face | β sentence-transformers | β Various | β |
Typical performance improvements with Valkey RAG caching:
| Metric | Without Cache | With Valkey Cache | Improvement |
|---|---|---|---|
| P50 Latency | 800ms | 2ms | 400x faster |
| P99 Latency | 2500ms | 15ms | 166x faster |
| LLM API Calls | 100% | 15-30% | 70-85% reduction |
| Cost (per 1M queries) | $500 | $75-150 | 70% savings |
See benchmarks/ for detailed benchmark scripts and results.
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Valkey Official Website
- Valkey Documentation
- Valkey GitHub
- AWS ElastiCache for Valkey
- Vector Similarity Search in Valkey
Built with β€οΈ for the AI agent community