Release 0.20251013.0 · muxi-ai/onellm

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

42,000-143,000x faster responses: Cache hits return in ~7µs vs 300-1000ms for API calls
50-80% cost savings: Dramatically reduces API costs through intelligent caching
Zero ongoing API costs: Uses local multilingual embedding model (paraphrase-multilingual-MiniLM-L12-v2)
Two-tier matching: Hash-based exact matching (~2µs) with semantic similarity fallback (~18ms)
Streaming support: Artificial streaming for cached responses preserves natural UX
TTL with refresh-on-access: Configurable time-to-live (default: 86400s / 1 day)
50+ language support: Multilingual semantic matching out of the box
LRU eviction: Memory-bounded with configurable max entries (default: 1000)

Cache Configuration

import onellm

# Initialize semantic cache
onellm.init_cache(
    max_entries=1000,           # Maximum cache entries
    p=0.95,                     # Similarity threshold (0-1)
    hash_only=False,            # Enable semantic matching
    stream_chunk_strategy="words",  # Streaming chunking: words/sentences/paragraphs/characters
    stream_chunk_length=8,      # Chunks per yield
    ttl=86400                   # Time-to-live in seconds (1 day)
)

# Use cache with any provider
response = onellm.ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# Cache management
stats = onellm.cache_stats()    # Get hit/miss/entries stats
onellm.clear_cache()            # Clear all entries
onellm.disable_cache()          # Disable caching

Performance Benchmarks

Hash exact match: ~2µs (2,000,000% faster than API)
Semantic match: ~18ms (1,500-5,000% faster than API)
Typical API call: 300-1000ms
Streaming simulation: Instant cached response with natural chunked delivery
Model download: One-time 118MB download (~13s on first init)

Technical Details

Dependencies: Added sentence-transformers>=2.0.0 and faiss-cpu>=1.7.0 to core dependencies
Memory-only: In-memory cache for long-running processes (no persistence)
Thread-safe: OrderedDict-based LRU with atomic operations
Streaming chunking: Four strategies (words, sentences, paragraphs, characters) for natural streaming UX
TTL refresh: Cache hits refresh TTL, keeping frequently-used entries alive
Hash key filtering: Excludes non-semantic parameters (stream, timeout, metadata) from cache key

Documentation

New docs: Comprehensive docs/caching.md with architecture, usage, and best practices
Updated README: Highlighted semantic caching in Key Features and Advanced Features
Updated docs: Added caching to docs/README.md, docs/advanced-features.md, and docs/quickstart.md
Examples: Added examples/cache_example.py demonstrating all cache features

Use Cases

Ideal for:

High-traffic web applications with repeated queries
Interactive demos and chatbots
Development and testing environments
API cost optimization
Latency-sensitive applications

Limited for:

Stateless serverless functions (short-lived processes)
Highly unique, non-repetitive queries
Contexts requiring strict data freshness

What's Changed

feat: add blazing-fast semantic caching with multilingual support by @ranaroussi in #5

Full Changelog: 0.20251008.0...0.20251013.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

0.20251013.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Semantic Caching: Blazing-fast in-memory cache with intelligent semantic matching

Cache Configuration

Performance Benchmarks

Technical Details

Documentation

Use Cases

What's Changed

Contributors

Uh oh!