Valkey RAG Cache

Reference architecture for using Valkey as a high-performance RAG (Retrieval-Augmented Generation) retrieval cache

🎯 Overview

This repository provides a comprehensive reference implementation for using Valkey as a caching layer in RAG (Retrieval-Augmented Generation) applications. It includes semantic caching, vector similarity search, and intelligent cache invalidation patterns that dramatically reduce latency and costs in LLM-powered applications.

Why Valkey for RAG Caching?

Capability	Benefit
Sub-millisecond latency	Cache hits return in <1ms vs 500ms+ for LLM calls
Semantic similarity search	Match queries by meaning, not just exact text
Vector search built-in	Native HNSW/FLAT indexing for embeddings
Cost reduction	70-90% reduction in LLM API calls with proper caching
True open source	BSD-3 licensed, no commercial restrictions

🆚 Why Valkey Over Redis?

Valkey is a community-driven fork of Redis 7.2.4, maintained under the Linux Foundation. Here's why it's the superior choice for RAG applications:

Feature	Valkey	Redis
License	BSD-3 (True Open Source)	SSPL/RSALv2 (Restrictive)
Commercial Cloud Use	✅ No restrictions	⚠️ License restrictions
Vector Search	✅ Native module (BSD-3)	⚠️ Redis Stack (proprietary)
Community Governance	✅ Linux Foundation	❌ Single company controlled
AWS ElastiCache Support	✅ Native support	⚠️ Being phased out
Fork Compatibility	✅ Redis protocol compatible	N/A

Valkey-Specific Features Leveraged

valkey-search Module - Native vector similarity search without licensing concerns
RDMA Support - Ultra-low latency for high-performance deployments
Improved Cluster Scaling - Better slot migration for large clusters
Enhanced Memory Efficiency - Optimized memory allocator options
Active Defragmentation - Improved memory management for long-running caches

📐 Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                              RAG Application                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────┐ │
│   │   Query     │     │   Semantic       │     │   LLM Response          │ │
│   │   Handler   │────▶│   Cache Layer    │────▶│   Generator             │ │
│   │             │     │   (Valkey)       │     │   (OpenAI/Bedrock)      │ │
│   └─────────────┘     └────────┬─────────┘     └─────────────────────────┘ │
│                                │                                            │
│                    ┌───────────▼───────────┐                               │
│                    │   Vector Retrieval    │                               │
│                    │   Cache (Valkey)      │                               │
│                    │   ─────────────────   │                               │
│                    │   • Document chunks   │                               │
│                    │   • Embeddings        │                               │
│                    │   • Metadata          │                               │
│                    └───────────────────────┘                               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                           Valkey Cluster                                    │
│  ┌────────────────────────────────────────────────────────────────────────┐│
│  │  Primary Node (Shard 1)  │  Primary Node (Shard 2)  │  Primary (Shard N)││
│  │  ├─ Semantic Cache       │  ├─ Semantic Cache       │  ├─ Semantic Cache││
│  │  ├─ Vector Index (HNSW)  │  ├─ Vector Index (HNSW)  │  ├─ Vector Index  ││
│  │  └─ Response Cache       │  └─ Response Cache       │  └─ Response Cache││
│  │         │                │         │                │         │         ││
│  │         ▼                │         ▼                │         ▼         ││
│  │    Replica Node          │    Replica Node          │    Replica Node   ││
│  └────────────────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Docker & Docker Compose
Python 3.11+ / Node.js 18+ / Go 1.21+
OpenAI API key (or Ollama for local development)

1. Start Valkey with Vector Search

cd deployment/docker
docker-compose up -d

2. Run the Python Semantic Cache Example

cd examples/python/semantic-cache
pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"
python main.py

3. Verify It's Working

# First query - cache miss, calls LLM
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is machine learning?"}'

# Similar query - cache hit!
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain machine learning to me"}'

📁 Repository Structure

valkey-rag-cache/
├── README.md                          # This file
├── ARCHITECTURE.md                    # Detailed architecture documentation
├── LICENSE                            # Apache 2.0
│
├── reference-architecture/
│   ├── diagrams/                      # Architecture diagrams (SVG, PNG)
│   └── design-decisions.md            # ADRs and design rationale
│
├── examples/
│   ├── python/                        # Python implementations
│   │   ├── semantic-cache/            # Semantic caching demo
│   │   ├── vector-search/             # Vector similarity search
│   │   ├── rag-pipeline/              # Full RAG pipeline
│   │   └── hybrid-search/             # Vector + keyword hybrid
│   │
│   ├── typescript/                    # TypeScript implementations
│   │   ├── semantic-cache/
│   │   ├── vector-search/
│   │   └── rag-pipeline/
│   │
│   └── go/                            # Go implementations
│       ├── semantic-cache/
│       ├── vector-search/
│       └── rag-pipeline/
│
├── cookbooks/                         # Step-by-step guides
│   ├── 01-getting-started.md
│   ├── 02-semantic-caching.md
│   ├── 03-vector-search-patterns.md
│   ├── 04-cache-invalidation.md
│   ├── 05-scaling-production.md
│   └── 06-monitoring-observability.md
│
├── deployment/
│   ├── docker/                        # Local development
│   ├── kubernetes/                    # K8s deployment
│   └── aws/                           # AWS deployment (CDK, CloudFormation)
│
├── benchmarks/                        # Performance benchmarks
│   ├── latency/
│   ├── throughput/
│   └── comparison/
│
└── tests/                             # Test suites
    ├── integration/
    └── performance/

🛠️ Examples Overview

Semantic Cache

Cache LLM responses based on semantic similarity of queries. When a user asks a question similar to a previously asked one, return the cached response instead of calling the LLM again.

Key Features:

Configurable similarity threshold (default: 0.92)
TTL-based expiration
Cache hit/miss metrics
Multi-tenant support

📖 Python Example | 📖 TypeScript Example | 📖 Go Example

Vector Search for RAG Retrieval

Store and retrieve document chunks using vector similarity search. This is the core retrieval mechanism for RAG applications.

Key Features:

HNSW index for fast approximate search
FLAT index for exact search (smaller datasets)
Metadata filtering
Hybrid search (vector + keyword)

📖 Python Example | 📖 TypeScript Example | 📖 Go Example

Full RAG Pipeline

End-to-end RAG implementation combining document ingestion, vector retrieval, semantic caching, and LLM response generation.

Key Features:

Document chunking strategies
Embedding generation (OpenAI, Bedrock, Ollama)
Multi-level caching
Response streaming

📖 Python Example | 📖 TypeScript Example | 📖 Go Example

📚 Cookbooks

Cookbook	Description
Getting Started	Local setup, basic operations, first semantic cache
Semantic Caching	Threshold tuning, cache warming, best practices
Vector Search Patterns	Index types, hybrid search, re-ranking
Cache Invalidation	TTL strategies, event-driven invalidation
Production Scaling	Cluster mode, replication, sharding
Monitoring	Metrics, alerting, debugging

🔧 Supported LLM/Embedding Providers

Provider	Embeddings	Chat/Completion	Local
OpenAI	✅ text-embedding-3-large	✅ GPT-4, GPT-4o	❌
Amazon Bedrock	✅ Titan Embeddings	✅ Claude 3.5, Llama	❌
Ollama	✅ nomic-embed-text	✅ Llama, Mistral	✅
Hugging Face	✅ sentence-transformers	✅ Various	✅

📊 Performance Benchmarks

Typical performance improvements with Valkey RAG caching:

Metric	Without Cache	With Valkey Cache	Improvement
P50 Latency	800ms	2ms	400x faster
P99 Latency	2500ms	15ms	166x faster
LLM API Calls	100%	15-30%	70-85% reduction
Cost (per 1M queries)	$500	$75-150	70% savings

See benchmarks/ for detailed benchmark scripts and results.

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🔗 Resources

Built with ❤️ for the AI agent community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Valkey RAG Cache

🎯 Overview

Why Valkey for RAG Caching?

🆚 Why Valkey Over Redis?

Valkey-Specific Features Leveraged

📐 Architecture

🚀 Quick Start

Prerequisites

1. Start Valkey with Vector Search

2. Run the Python Semantic Cache Example

3. Verify It's Working

📁 Repository Structure

🛠️ Examples Overview

Semantic Cache

Vector Search for RAG Retrieval

Full RAG Pipeline

📚 Cookbooks

🔧 Supported LLM/Embedding Providers

📊 Performance Benchmarks

🤝 Contributing

📄 License

🔗 Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cookbooks		cookbooks
deployment		deployment
examples		examples
sample-apps		sample-apps
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Valkey RAG Cache

🎯 Overview

Why Valkey for RAG Caching?

🆚 Why Valkey Over Redis?

Valkey-Specific Features Leveraged

📐 Architecture

🚀 Quick Start

Prerequisites

1. Start Valkey with Vector Search

2. Run the Python Semantic Cache Example

3. Verify It's Working

📁 Repository Structure

🛠️ Examples Overview

Semantic Cache

Vector Search for RAG Retrieval

Full RAG Pipeline

📚 Cookbooks

🔧 Supported LLM/Embedding Providers

📊 Performance Benchmarks

🤝 Contributing

📄 License

🔗 Resources

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages