Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching
- 42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
- 50-80% cost savings: Dramatically reduces API costs through intelligent caching
- Zero ongoing API costs: Uses local multilingual embedding model (
paraphrase-multilingual-MiniLM-L12-v2) - Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
- Streaming support: Artificial streaming for cached responses preserves natural UX
- TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
- 50+ language support: Multilingual semantic matching out of the box
- LRU eviction: Memory-bounded with configurable max entries (default: 1000)
Cache Configuration
import onellm
# Initialize semantic cache
onellm.init_cache(
max_entries=1000, # Maximum cache entries
p=0.95, # Similarity threshold (0-1)
hash_only=False, # Enable semantic matching
stream_chunk_strategy="words", # Streaming chunking: words/sentences/paragraphs/characters
stream_chunk_length=8, # Chunks per yield
ttl=86400 # Time-to-live in seconds (1 day)
)
# Use cache with any provider
response = onellm.ChatCompletion.create(
model="openai/gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
# Cache management
stats = onellm.cache_stats() # Get hit/miss/entries stats
onellm.clear_cache() # Clear all entries
onellm.disable_cache() # Disable cachingPerformance Benchmarks
- Hash exact match: ~2µs (2,000,000% faster than API)
- Semantic match: ~18ms (1,500-5,000% faster than API)
- Typical API call: 300-1000ms
- Streaming simulation: Instant cached response with natural chunked delivery
- Model download: One-time 118MB download (~13s on first init)
Technical Details
- Dependencies: Added
sentence-transformers>=2.0.0andfaiss-cpu>=1.7.0to core dependencies - Memory-only: In-memory cache for long-running processes (no persistence)
- Thread-safe: OrderedDict-based LRU with atomic operations
- Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
- TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
- Hash key filtering: Excludes non-semantic parameters (
stream,timeout,metadata) from cache key
Documentation
- New docs: Comprehensive
docs/caching.mdwith architecture, usage, and best practices - Updated README: Highlighted semantic caching in Key Features and Advanced Features
- Updated docs: Added caching to
docs/README.md,docs/advanced-features.md, anddocs/quickstart.md - Examples: Added
examples/cache_example.pydemonstrating all cache features
Use Cases
Ideal for:
- High-traffic web applications with repeated queries
- Interactive demos and chatbots
- Development and testing environments
- API cost optimization
- Latency-sensitive applications
Limited for:
- Stateless serverless functions (short-lived processes)
- Highly unique, non-repetitive queries
- Contexts requiring strict data freshness
What's Changed
- feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5
Full Changelog: 0.20251008.0...0.20251013.0