Skip to content

0.20251013.0

Latest

Choose a tag to compare

@ranaroussi ranaroussi released this 13 Oct 16:52
· 3 commits to main since this release
966d1aa

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

  • 42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
  • 50-80% cost savings: Dramatically reduces API costs through intelligent caching
  • Zero ongoing API costs: Uses local multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2)
  • Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
  • Streaming support: Artificial streaming for cached responses preserves natural UX
  • TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
  • 50+ language support: Multilingual semantic matching out of the box
  • LRU eviction: Memory-bounded with configurable max entries (default: 1000)

Cache Configuration

import onellm

# Initialize semantic cache
onellm.init_cache(
    max_entries=1000,           # Maximum cache entries
    p=0.95,                     # Similarity threshold (0-1)
    hash_only=False,            # Enable semantic matching
    stream_chunk_strategy="words",  # Streaming chunking: words/sentences/paragraphs/characters
    stream_chunk_length=8,      # Chunks per yield
    ttl=86400                   # Time-to-live in seconds (1 day)
)

# Use cache with any provider
response = onellm.ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Cache management
stats = onellm.cache_stats()    # Get hit/miss/entries stats
onellm.clear_cache()            # Clear all entries
onellm.disable_cache()          # Disable caching

Performance Benchmarks

  • Hash exact match: ~2µs (2,000,000% faster than API)
  • Semantic match: ~18ms (1,500-5,000% faster than API)
  • Typical API call: 300-1000ms
  • Streaming simulation: Instant cached response with natural chunked delivery
  • Model download: One-time 118MB download (~13s on first init)

Technical Details

  • Dependencies: Added sentence-transformers>=2.0.0 and faiss-cpu>=1.7.0 to core dependencies
  • Memory-only: In-memory cache for long-running processes (no persistence)
  • Thread-safe: OrderedDict-based LRU with atomic operations
  • Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
  • TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
  • Hash key filtering: Excludes non-semantic parameters (stream, timeout, metadata) from cache key

Documentation

  • New docs: Comprehensive docs/caching.md with architecture, usage, and best practices
  • Updated README: Highlighted semantic caching in Key Features and Advanced Features
  • Updated docs: Added caching to docs/README.md, docs/advanced-features.md, and docs/quickstart.md
  • Examples: Added examples/cache_example.py demonstrating all cache features

Use Cases

Ideal for:

  • High-traffic web applications with repeated queries
  • Interactive demos and chatbots
  • Development and testing environments
  • API cost optimization
  • Latency-sensitive applications

Limited for:

  • Stateless serverless functions (short-lived processes)
  • Highly unique, non-repetitive queries
  • Contexts requiring strict data freshness

What's Changed

  • feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5

Full Changelog: 0.20251008.0...0.20251013.0