Skip to content

justinfung/valkey-rag-cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Valkey RAG Cache

Reference architecture for using Valkey as a high-performance RAG (Retrieval-Augmented Generation) retrieval cache

License Valkey Python TypeScript Go

🎯 Overview

This repository provides a comprehensive reference implementation for using Valkey as a caching layer in RAG (Retrieval-Augmented Generation) applications. It includes semantic caching, vector similarity search, and intelligent cache invalidation patterns that dramatically reduce latency and costs in LLM-powered applications.

Why Valkey for RAG Caching?

Capability Benefit
Sub-millisecond latency Cache hits return in <1ms vs 500ms+ for LLM calls
Semantic similarity search Match queries by meaning, not just exact text
Vector search built-in Native HNSW/FLAT indexing for embeddings
Cost reduction 70-90% reduction in LLM API calls with proper caching
True open source BSD-3 licensed, no commercial restrictions

πŸ†š Why Valkey Over Redis?

Valkey is a community-driven fork of Redis 7.2.4, maintained under the Linux Foundation. Here's why it's the superior choice for RAG applications:

Feature Valkey Redis
License BSD-3 (True Open Source) SSPL/RSALv2 (Restrictive)
Commercial Cloud Use βœ… No restrictions ⚠️ License restrictions
Vector Search βœ… Native module (BSD-3) ⚠️ Redis Stack (proprietary)
Community Governance βœ… Linux Foundation ❌ Single company controlled
AWS ElastiCache Support βœ… Native support ⚠️ Being phased out
Fork Compatibility βœ… Redis protocol compatible N/A

Valkey-Specific Features Leveraged

  1. valkey-search Module - Native vector similarity search without licensing concerns
  2. RDMA Support - Ultra-low latency for high-performance deployments
  3. Improved Cluster Scaling - Better slot migration for large clusters
  4. Enhanced Memory Efficiency - Optimized memory allocator options
  5. Active Defragmentation - Improved memory management for long-running caches

πŸ“ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              RAG Application                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚   Query     β”‚     β”‚   Semantic       β”‚     β”‚   LLM Response          β”‚ β”‚
β”‚   β”‚   Handler   │────▢│   Cache Layer    │────▢│   Generator             β”‚ β”‚
β”‚   β”‚             β”‚     β”‚   (Valkey)       β”‚     β”‚   (OpenAI/Bedrock)      β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                β”‚                                            β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
β”‚                    β”‚   Vector Retrieval    β”‚                               β”‚
β”‚                    β”‚   Cache (Valkey)      β”‚                               β”‚
β”‚                    β”‚   ─────────────────   β”‚                               β”‚
β”‚                    β”‚   β€’ Document chunks   β”‚                               β”‚
β”‚                    β”‚   β€’ Embeddings        β”‚                               β”‚
β”‚                    β”‚   β€’ Metadata          β”‚                               β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Valkey Cluster                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚  Primary Node (Shard 1)  β”‚  Primary Node (Shard 2)  β”‚  Primary (Shard N)β”‚β”‚
β”‚  β”‚  β”œβ”€ Semantic Cache       β”‚  β”œβ”€ Semantic Cache       β”‚  β”œβ”€ Semantic Cacheβ”‚β”‚
β”‚  β”‚  β”œβ”€ Vector Index (HNSW)  β”‚  β”œβ”€ Vector Index (HNSW)  β”‚  β”œβ”€ Vector Index  β”‚β”‚
β”‚  β”‚  └─ Response Cache       β”‚  └─ Response Cache       β”‚  └─ Response Cacheβ”‚β”‚
β”‚  β”‚         β”‚                β”‚         β”‚                β”‚         β”‚         β”‚β”‚
β”‚  β”‚         β–Ό                β”‚         β–Ό                β”‚         β–Ό         β”‚β”‚
β”‚  β”‚    Replica Node          β”‚    Replica Node          β”‚    Replica Node   β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • Python 3.11+ / Node.js 18+ / Go 1.21+
  • OpenAI API key (or Ollama for local development)

1. Start Valkey with Vector Search

cd deployment/docker
docker-compose up -d

2. Run the Python Semantic Cache Example

cd examples/python/semantic-cache
pip install -r requirements.txt
export OPENAI_API_KEY="your-key-here"
python main.py

3. Verify It's Working

# First query - cache miss, calls LLM
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is machine learning?"}'

# Similar query - cache hit!
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain machine learning to me"}'

πŸ“ Repository Structure

valkey-rag-cache/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ ARCHITECTURE.md                    # Detailed architecture documentation
β”œβ”€β”€ LICENSE                            # Apache 2.0
β”‚
β”œβ”€β”€ reference-architecture/
β”‚   β”œβ”€β”€ diagrams/                      # Architecture diagrams (SVG, PNG)
β”‚   └── design-decisions.md            # ADRs and design rationale
β”‚
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ python/                        # Python implementations
β”‚   β”‚   β”œβ”€β”€ semantic-cache/            # Semantic caching demo
β”‚   β”‚   β”œβ”€β”€ vector-search/             # Vector similarity search
β”‚   β”‚   β”œβ”€β”€ rag-pipeline/              # Full RAG pipeline
β”‚   β”‚   └── hybrid-search/             # Vector + keyword hybrid
β”‚   β”‚
β”‚   β”œβ”€β”€ typescript/                    # TypeScript implementations
β”‚   β”‚   β”œβ”€β”€ semantic-cache/
β”‚   β”‚   β”œβ”€β”€ vector-search/
β”‚   β”‚   └── rag-pipeline/
β”‚   β”‚
β”‚   └── go/                            # Go implementations
β”‚       β”œβ”€β”€ semantic-cache/
β”‚       β”œβ”€β”€ vector-search/
β”‚       └── rag-pipeline/
β”‚
β”œβ”€β”€ cookbooks/                         # Step-by-step guides
β”‚   β”œβ”€β”€ 01-getting-started.md
β”‚   β”œβ”€β”€ 02-semantic-caching.md
β”‚   β”œβ”€β”€ 03-vector-search-patterns.md
β”‚   β”œβ”€β”€ 04-cache-invalidation.md
β”‚   β”œβ”€β”€ 05-scaling-production.md
β”‚   └── 06-monitoring-observability.md
β”‚
β”œβ”€β”€ deployment/
β”‚   β”œβ”€β”€ docker/                        # Local development
β”‚   β”œβ”€β”€ kubernetes/                    # K8s deployment
β”‚   └── aws/                           # AWS deployment (CDK, CloudFormation)
β”‚
β”œβ”€β”€ benchmarks/                        # Performance benchmarks
β”‚   β”œβ”€β”€ latency/
β”‚   β”œβ”€β”€ throughput/
β”‚   └── comparison/
β”‚
└── tests/                             # Test suites
    β”œβ”€β”€ integration/
    └── performance/

πŸ› οΈ Examples Overview

Semantic Cache

Cache LLM responses based on semantic similarity of queries. When a user asks a question similar to a previously asked one, return the cached response instead of calling the LLM again.

Key Features:

  • Configurable similarity threshold (default: 0.92)
  • TTL-based expiration
  • Cache hit/miss metrics
  • Multi-tenant support

πŸ“– Python Example | πŸ“– TypeScript Example | πŸ“– Go Example

Vector Search for RAG Retrieval

Store and retrieve document chunks using vector similarity search. This is the core retrieval mechanism for RAG applications.

Key Features:

  • HNSW index for fast approximate search
  • FLAT index for exact search (smaller datasets)
  • Metadata filtering
  • Hybrid search (vector + keyword)

πŸ“– Python Example | πŸ“– TypeScript Example | πŸ“– Go Example

Full RAG Pipeline

End-to-end RAG implementation combining document ingestion, vector retrieval, semantic caching, and LLM response generation.

Key Features:

  • Document chunking strategies
  • Embedding generation (OpenAI, Bedrock, Ollama)
  • Multi-level caching
  • Response streaming

πŸ“– Python Example | πŸ“– TypeScript Example | πŸ“– Go Example

πŸ“š Cookbooks

Cookbook Description
Getting Started Local setup, basic operations, first semantic cache
Semantic Caching Threshold tuning, cache warming, best practices
Vector Search Patterns Index types, hybrid search, re-ranking
Cache Invalidation TTL strategies, event-driven invalidation
Production Scaling Cluster mode, replication, sharding
Monitoring Metrics, alerting, debugging

πŸ”§ Supported LLM/Embedding Providers

Provider Embeddings Chat/Completion Local
OpenAI βœ… text-embedding-3-large βœ… GPT-4, GPT-4o ❌
Amazon Bedrock βœ… Titan Embeddings βœ… Claude 3.5, Llama ❌
Ollama βœ… nomic-embed-text βœ… Llama, Mistral βœ…
Hugging Face βœ… sentence-transformers βœ… Various βœ…

πŸ“Š Performance Benchmarks

Typical performance improvements with Valkey RAG caching:

Metric Without Cache With Valkey Cache Improvement
P50 Latency 800ms 2ms 400x faster
P99 Latency 2500ms 15ms 166x faster
LLM API Calls 100% 15-30% 70-85% reduction
Cost (per 1M queries) $500 $75-150 70% savings

See benchmarks/ for detailed benchmark scripts and results.

🀝 Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

πŸ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

πŸ”— Resources


Built with ❀️ for the AI agent community

About

Production-ready reference architecture for using Valkey as a high-performance RAG retrieval cache with semantic caching and vector search

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors