feat: Add LLM call memoization for llm_query (Issue #82)#118
feat: Add LLM call memoization for llm_query (Issue #82)#118zamal-db wants to merge 1 commit intoalexzhang13:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements LLM call memoization to reduce redundant API calls during recursive RLM execution, addressing Issue #82. The implementation provides a thread-safe LRU cache with optional TTL for llm_query and llm_query_batched calls.
Changes:
- Added comprehensive LLMCallCache implementation with LRU eviction and optional TTL
- Integrated cache support into RLM and LocalREPL for the "local" environment type
- Added extensive unit tests covering cache operations, thread safety, LRU behavior, and TTL expiration
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| rlm/utils/cache.py | New cache implementation with LLMCallCache, CacheEntry, CacheStats classes and factory function |
| rlm/utils/init.py | Exports cache classes for public API |
| rlm/core/rlm.py | Adds cache parameter to RLM constructor and passes it to local environments |
| rlm/environments/local_repl.py | Integrates cache into _llm_query and _llm_query_batched methods |
| tests/test_cache.py | Comprehensive test suite for cache functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| message token count reaches this fraction of the model context limit (default 0.85). | ||
| cache: Optional LLMCallCache for memoizing llm_query calls. When provided, identical prompts | ||
| will return cached responses instead of making redundant API calls. Useful for recursive | ||
| workloads with overlapping subproblems (e.g., Fibonacci-like decomposition). |
There was a problem hiding this comment.
The cache parameter documentation should mention that caching is only supported for the 'local' environment type. Users trying to use cache with other environment types (modal, docker, daytona, prime, e2b) will silently have their cache ignored without any warning or error, which could be confusing.
Consider either:
- Adding a note in the docstring: "Note: caching is only supported for environment='local'"
- Raising a warning or error if cache is provided with a non-local environment
- Implementing cache support for other environment types (if feasible)
| workloads with overlapping subproblems (e.g., Fibonacci-like decomposition). | |
| workloads with overlapping subproblems (e.g., Fibonacci-like decomposition). | |
| Note: caching is only supported for environment='local'; for other environments, the cache is ignored. |
rlm/environments/local_repl.py
Outdated
| results_list = [] | ||
| for response in responses: | ||
| if not response.success: | ||
| results.append(f"Error: {response.error}") | ||
| results_list.append(f"Error: {response.error}") | ||
| else: | ||
| self._pending_llm_calls.append(response.chat_completion) | ||
| results.append(response.chat_completion.response) | ||
| results_list.append(response.chat_completion.response) | ||
|
|
||
| return results | ||
| return results_list |
There was a problem hiding this comment.
Inconsistent variable naming: In the cached path (line 299), the results are stored in 'results', while in the non-cached path (line 336), they're stored in 'results_list'. Both represent the same thing (a list of string responses) and should use the same name for consistency.
Recommend using 'results' for both paths to improve code readability.
rlm/environments/local_repl.py
Outdated
| # Cache the successful response | ||
| self.cache.set(prompt, model, result) | ||
|
|
||
| return [r if r is not None else "Error: Unexpected cache failure" for r in results] |
There was a problem hiding this comment.
The error message "Error: Unexpected cache failure" on line 329 is a defensive check that should never occur in normal operation. However, if it does occur, the error message doesn't provide enough information for debugging.
Consider making this more informative, e.g., including the index: f"Error: Unexpected cache failure for prompt at index {i}". Alternatively, raise an AssertionError instead since this represents a bug in the cache logic if it ever occurs.
| return [r if r is not None else "Error: Unexpected cache failure" for r in results] | |
| return [ | |
| r if r is not None else f"Error: Unexpected cache failure for prompt at index {i}" | |
| for i, r in enumerate(results) | |
| ] |
| # Define the actual API call function | ||
| def make_api_call() -> str: | ||
| request = LMRequest(prompt=prompt, model=model, depth=self.depth) | ||
| response = send_lm_request(self.lm_handler_address, request) | ||
|
|
||
| if not response.success: | ||
| return f"Error: {response.error}" | ||
| raise RuntimeError(response.error) | ||
|
|
||
| self._pending_llm_calls.append(response.chat_completion) | ||
| return response.chat_completion.response |
There was a problem hiding this comment.
When a cached response is returned, the RLMChatCompletion is not appended to _pending_llm_calls (line 267 only executes on cache miss). This means cached responses won't appear in the REPLResult.rlm_calls list returned by execute_code().
While this might be intentional (cached calls don't consume tokens/cost), it creates an inconsistency where some llm_query calls are tracked and others aren't. Consider either:
- Always tracking calls with a flag indicating cache hits (preserves full call history)
- Documenting this behavior clearly in the cache documentation
The current behavior could confuse users debugging their RLM execution traces.
rlm/environments/local_repl.py
Outdated
| self.cache.stats.record_hit() | ||
| else: | ||
| uncached_prompts.append((i, prompt)) | ||
| self.cache.stats.record_miss() |
There was a problem hiding this comment.
When cache.get() returns None due to an expired entry, it increments stats.expirations (cache.py:214), but this code then also calls record_miss() (line 310). This means expired entries are counted as both expirations AND misses, which could be confusing when analyzing cache statistics.
The semantics are unclear: should an expiration be considered a type of miss, or should they be mutually exclusive? Consider either:
- Documenting that expirations are a subset of misses
- Having get() return a sentinel value to distinguish "not found" from "expired" so they can be counted separately
- Not counting expirations as misses if they're already tracked separately
| self.cache.stats.record_hit() | |
| else: | |
| uncached_prompts.append((i, prompt)) | |
| self.cache.stats.record_miss() | |
| else: | |
| uncached_prompts.append((i, prompt)) |
| # Try cache first | ||
| cached = self.get(prompt, model) | ||
| if cached is not None: | ||
| with self._lock: | ||
| self._stats.hits += 1 | ||
| return cached, True | ||
|
|
||
| # Cache miss - make the call | ||
| with self._lock: | ||
| self._stats.misses += 1 | ||
|
|
||
| result = call_fn() | ||
|
|
||
| # Store in cache | ||
| self.set(prompt, model, result) | ||
|
|
||
| return result, False |
There was a problem hiding this comment.
Race condition: Multiple threads can make redundant API calls for the same prompt. The get_or_call method checks the cache (line 281), releases the lock, then makes the API call (line 291). If multiple threads call get_or_call with the same prompt simultaneously, they will all see a cache miss and all make API calls.
This "thundering herd" problem means the cache won't prevent redundant API calls in concurrent scenarios with identical prompts. A common solution is to use a "single-flight" pattern where only the first thread makes the call and others wait for its result. This can be implemented with per-key locks or condition variables.
While this may be acceptable depending on usage patterns, it should be documented as a known limitation if not fixed.
| hash1 = LLMCallCache._hash_key(unicode_prompt, "model") | ||
| hash2 = LLMCallCache._hash_key(unicode_prompt, "model") | ||
|
|
||
| assert hash1 == hash2 |
There was a problem hiding this comment.
The test suite comprehensively tests the LLMCallCache in isolation, but there are no integration tests that verify the cache works correctly when integrated with RLM and LocalREPL.
Consider adding integration tests that:
- Create an RLM with a cache and verify identical prompts are cached
- Test that cache statistics are correctly tracked through actual RLM execution
- Verify the interaction between caching and _pending_llm_calls tracking
- Test that cached vs non-cached calls produce identical results from the user's perspective
| assert hash1 == hash2 | |
| assert hash1 == hash2 | |
| class FakeRLM: | |
| """Minimal RLM-like class to test integration with LLMCallCache. | |
| This simulates how an RLM might: | |
| - Use an LLMCallCache for prompt/model pairs. | |
| - Track pending LLM calls via a _pending_llm_calls set. | |
| """ | |
| def __init__(self, cache: LLMCallCache): | |
| self.cache = cache | |
| self._pending_llm_calls = set() | |
| self._call_count = 0 | |
| def llm_call(self, prompt: str, model: str = "gpt-4") -> str: | |
| """Simulate an LLM call that goes through the cache.""" | |
| key = (prompt, model) | |
| self._pending_llm_calls.add(key) | |
| def do_call() -> str: | |
| # Underlying model call that should only run on cache misses. | |
| self._call_count += 1 | |
| return f"response for {prompt!r} with model {model!r}" | |
| try: | |
| return self.cache.get_or_call(prompt, model, do_call) | |
| finally: | |
| # Ensure pending calls tracking is always cleaned up. | |
| self._pending_llm_calls.discard(key) | |
| class FakeLocalREPL: | |
| """Minimal LocalREPL-like wrapper around an RLM instance.""" | |
| def __init__(self, rlm: FakeRLM): | |
| self._rlm = rlm | |
| def run_prompt(self, prompt: str, model: str = "gpt-4") -> str: | |
| """Simulate a user sending a prompt through a REPL.""" | |
| return self._rlm.llm_call(prompt, model) | |
| class TestCacheIntegrationWithRLMAndLocalREPL: | |
| """Integration-style tests for LLMCallCache with RLM and LocalREPL. | |
| These tests verify: | |
| 1. Identical prompts are cached when going through an RLM and REPL. | |
| 2. Cache statistics are updated through real execution paths. | |
| 3. _pending_llm_calls interacts correctly with cached/non-cached calls. | |
| 4. From the user's perspective, cached vs non-cached calls behave identically. | |
| """ | |
| def test_identical_prompts_are_cached_via_rlm_and_repl(self): | |
| """Create an RLM with a cache and verify identical prompts are cached.""" | |
| cache = create_cache(enabled=True, max_size=10) | |
| rlm = FakeRLM(cache) | |
| repl = FakeLocalREPL(rlm) | |
| prompt = "Explain caching." | |
| # First call should be a cache miss and invoke the underlying model. | |
| response1 = repl.run_prompt(prompt, model="gpt-4") | |
| # Second call with identical prompt/model should be served from cache. | |
| response2 = repl.run_prompt(prompt, model="gpt-4") | |
| # From the user's perspective, responses must be identical. | |
| assert response1 == response2 | |
| # Underlying model should have been invoked only once. | |
| assert rlm._call_count == 1 | |
| # Verify cache statistics reflect one miss (first call) and one hit (second). | |
| stats: CacheStats = cache.get_stats() | |
| assert stats.misses == 1 | |
| assert stats.hits == 1 | |
| def test_pending_llm_calls_tracking_with_cache(self): | |
| """Verify interaction between caching and _pending_llm_calls tracking.""" | |
| cache = create_cache(enabled=True, max_size=10) | |
| rlm = FakeRLM(cache) | |
| prompt = "Track pending calls." | |
| model = "gpt-4" | |
| # On first call, we should see the key added then removed. | |
| response1 = rlm.llm_call(prompt, model=model) | |
| assert isinstance(response1, str) | |
| assert rlm._call_count == 1 | |
| # After the call completes, there should be no pending calls. | |
| assert not rlm._pending_llm_calls | |
| # Second call should be served from cache; pending set still must be cleaned. | |
| response2 = rlm.llm_call(prompt, model=model) | |
| assert response2 == response1 | |
| assert rlm._call_count == 1 # still only the original miss | |
| assert not rlm._pending_llm_calls | |
| def test_cached_and_non_cached_calls_identical_from_user_perspective(self): | |
| """Ensure cached vs non-cached calls produce identical visible results.""" | |
| cache = create_cache(enabled=True, max_size=10) | |
| rlm = FakeRLM(cache) | |
| repl = FakeLocalREPL(rlm) | |
| # First prompt will have a cache miss on first call, hit on second. | |
| prompt_cached = "What is an LRU cache?" | |
| # Second prompt is only called once (always a miss). | |
| prompt_uncached = "Describe TTL-based expiration." | |
| # Calls that will be cached. | |
| cached_first = repl.run_prompt(prompt_cached) | |
| cached_second = repl.run_prompt(prompt_cached) | |
| # Call that is never repeated (always non-cached at the time of the call). | |
| uncached = repl.run_prompt(prompt_uncached) | |
| # User-visible behavior: repeated calls return the same response. | |
| assert cached_first == cached_second | |
| # Different prompts yield different responses. | |
| assert cached_first != uncached | |
| # Stats: two distinct prompts, three total calls, with one cache hit. | |
| stats: CacheStats = cache.get_stats() | |
| assert stats.misses == 2 # first time for each distinct prompt | |
| assert stats.hits == 1 # second call for prompt_cached | |
| # Underlying model should have run exactly once per distinct prompt. | |
| assert rlm._call_count == 2 |
5c29380 to
e43c7f9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/test_cache.py
Outdated
| def test_expired_entry_returns_none(self): | ||
| """Test that expired entries return None.""" | ||
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | ||
|
|
||
| cache.set("prompt", "model", "response") | ||
|
|
||
| # Immediate access should work | ||
| assert cache.get("prompt", "model") == "response" | ||
|
|
||
| # Wait for expiration | ||
| time.sleep(0.15) | ||
|
|
||
| # Should be expired | ||
| assert cache.get("prompt", "model") is None | ||
|
|
||
| def test_expiration_updates_stats(self): | ||
| """Test that expirations update statistics.""" | ||
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | ||
|
|
||
| cache.set("prompt", "model", "response") | ||
|
|
||
| time.sleep(0.1) | ||
|
|
||
| cache.get("prompt", "model") # Triggers expiration check | ||
|
|
There was a problem hiding this comment.
These TTL tests depend on very small sleep intervals (100–150ms), which can be flaky on slower/loaded CI runners. Prefer controlling time (e.g., monkeypatch time.time) or increasing the margin significantly so the test is deterministic.
| def test_expired_entry_returns_none(self): | |
| """Test that expired entries return None.""" | |
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Immediate access should work | |
| assert cache.get("prompt", "model") == "response" | |
| # Wait for expiration | |
| time.sleep(0.15) | |
| # Should be expired | |
| assert cache.get("prompt", "model") is None | |
| def test_expiration_updates_stats(self): | |
| """Test that expirations update statistics.""" | |
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | |
| cache.set("prompt", "model", "response") | |
| time.sleep(0.1) | |
| cache.get("prompt", "model") # Triggers expiration check | |
| def test_expired_entry_returns_none(self, monkeypatch): | |
| """Test that expired entries return None.""" | |
| # Control time to make the test deterministic instead of relying on sleep. | |
| base_time = 1_000.0 | |
| current_time = base_time | |
| def fake_time(): | |
| return current_time | |
| # Patch the time used inside the cache implementation. | |
| monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time) | |
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Immediate access should work at base_time. | |
| assert cache.get("prompt", "model") == "response" | |
| # Advance time beyond the TTL to trigger expiration. | |
| current_time = base_time + 0.2 | |
| # Should be expired now. | |
| assert cache.get("prompt", "model") is None | |
| def test_expiration_updates_stats(self, monkeypatch): | |
| """Test that expirations update statistics.""" | |
| base_time = 2_000.0 | |
| current_time = base_time | |
| def fake_time(): | |
| return current_time | |
| # Patch the time used inside the cache implementation. | |
| monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time) | |
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Advance time beyond the TTL so that the next access expires the entry. | |
| current_time = base_time + 0.1 | |
| cache.get("prompt", "model") # Triggers expiration check |
tests/test_cache.py
Outdated
| def test_expired_entry_returns_none(self): | ||
| """Test that expired entries return None.""" | ||
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | ||
|
|
||
| cache.set("prompt", "model", "response") | ||
|
|
||
| # Immediate access should work | ||
| assert cache.get("prompt", "model") == "response" | ||
|
|
||
| # Wait for expiration | ||
| time.sleep(0.15) | ||
|
|
||
| # Should be expired | ||
| assert cache.get("prompt", "model") is None | ||
|
|
||
| def test_expiration_updates_stats(self): | ||
| """Test that expirations update statistics.""" | ||
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | ||
|
|
||
| cache.set("prompt", "model", "response") | ||
|
|
||
| time.sleep(0.1) | ||
|
|
||
| cache.get("prompt", "model") # Triggers expiration check | ||
|
|
There was a problem hiding this comment.
This TTL expiration test uses a 50ms TTL and sleeps 100ms; that tight timing can be flaky in CI. Consider using a mocked clock (monkeypatch time.time) or a much larger TTL/sleep delta to avoid intermittent failures.
| def test_expired_entry_returns_none(self): | |
| """Test that expired entries return None.""" | |
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Immediate access should work | |
| assert cache.get("prompt", "model") == "response" | |
| # Wait for expiration | |
| time.sleep(0.15) | |
| # Should be expired | |
| assert cache.get("prompt", "model") is None | |
| def test_expiration_updates_stats(self): | |
| """Test that expirations update statistics.""" | |
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | |
| cache.set("prompt", "model", "response") | |
| time.sleep(0.1) | |
| cache.get("prompt", "model") # Triggers expiration check | |
| def test_expired_entry_returns_none(self, monkeypatch): | |
| """Test that expired entries return None.""" | |
| # Use a mocked clock to avoid flaky timing-based tests. | |
| fake_now = [1000.0] | |
| def fake_time() -> float: | |
| return fake_now[0] | |
| monkeypatch.setattr(time, "time", fake_time) | |
| cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Immediate access should work | |
| assert cache.get("prompt", "model") == "response" | |
| # Advance time beyond the TTL to simulate expiration | |
| fake_now[0] += 0.11 | |
| # Should be expired | |
| assert cache.get("prompt", "model") is None | |
| def test_expiration_updates_stats(self, monkeypatch): | |
| """Test that expirations update statistics.""" | |
| # Use a mocked clock to deterministically trigger expiration. | |
| fake_now = [2000.0] | |
| def fake_time() -> float: | |
| return fake_now[0] | |
| monkeypatch.setattr(time, "time", fake_time) | |
| cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL | |
| cache.set("prompt", "model", "response") | |
| # Advance time beyond the TTL so the next get triggers expiration. | |
| fake_now[0] += 0.06 | |
| cache.get("prompt", "model") # Triggers expiration check |
| # Normalize model to string | ||
| model_str = model if model else "__default__" | ||
|
|
||
| # Combine prompt and model for the key | ||
| content = f"{model_str}:{prompt}" | ||
|
|
||
| # Use SHA-256 for good distribution and collision resistance | ||
| return hashlib.sha256(content.encode("utf-8")).hexdigest() |
There was a problem hiding this comment.
_hash_key builds the hash input as f"{model_str}:{prompt}", which can collide when model contains ':' or prompt begins with ':' (e.g., model='a:' + prompt='b' equals model='a' + prompt=':b'). Use an unambiguous encoding (e.g., length-prefixing, '\0' separator, or hashing (model, prompt) as structured bytes/JSON) to avoid cache key collisions.
| # Evict oldest if at capacity | ||
| while len(self._cache) >= self.max_size: | ||
| self._cache.popitem(last=False) | ||
| self._stats.evictions += 1 |
There was a problem hiding this comment.
LLMCallCache.set() can raise or loop incorrectly when max_size <= 0 (e.g., while len(_cache) >= max_size will be true on an empty cache). Consider validating max_size >= 1 in post_init (or treating <=0 as 'disable caching') to prevent KeyError/infinite eviction loops.
rlm/utils/cache.py
Outdated
| This is the primary method for cache integration. It handles the full | ||
| cache-check -> call -> cache-store flow atomically. |
There was a problem hiding this comment.
The get_or_call docstring claims the cache-check -> call -> cache-store flow is atomic, but the implementation releases the lock before call_fn() and doesn't coordinate concurrent misses. Multiple threads can call the API for the same (prompt, model) simultaneously, defeating memoization. Either adjust the docstring or implement single-flight behavior (per-key lock/event/future) so only one in-flight call populates the cache.
| This is the primary method for cache integration. It handles the full | |
| cache-check -> call -> cache-store flow atomically. | |
| This is the primary method for cache integration. It performs the full | |
| cache-check -> call -> cache-store flow, but this sequence is not | |
| guaranteed to be atomic across threads. Concurrent cache misses for | |
| the same (prompt, model) may result in multiple calls to ``call_fn``. |
| def stats(self) -> CacheStats: | ||
| """Get cache statistics (read-only view).""" | ||
| return self._stats |
There was a problem hiding this comment.
stats is documented as a "read-only view" but returns the mutable internal CacheStats object directly, allowing callers to mutate counters without LLMCallCache's lock (breaking the thread-safety guarantee). Consider returning a copy/snapshot (e.g., to_dict()) or providing locked accessor methods for reading/updating stats.
| def stats(self) -> CacheStats: | |
| """Get cache statistics (read-only view).""" | |
| return self._stats | |
| def stats(self) -> dict[str, Any]: | |
| """Get cache statistics as a read-only snapshot.""" | |
| return self._stats.to_dict() |
| for i, prompt in enumerate(prompts): | ||
| cached = self.cache.get(prompt, model) | ||
| if cached is not None: | ||
| results[i] = cached | ||
| self.cache.stats.record_hit() | ||
| else: | ||
| uncached_prompts.append((i, prompt)) | ||
| self.cache.stats.record_miss() | ||
|
|
There was a problem hiding this comment.
In the caching path, _llm_query_batched mutates cache.stats via record_hit()/record_miss() without any synchronization. Since LLMCallCache claims thread safety, these counter updates should be performed under the cache's lock (e.g., via a cache method that records hits/misses internally), otherwise concurrent batched calls can corrupt stats.
| else: | ||
| return make_api_call() | ||
| except Exception as e: | ||
| return f"Error: LM query failed - {e}" |
There was a problem hiding this comment.
_llm_query returns errors as "Error: LM query failed - ..." while _llm_query_batched returns "Error: {response.error}" for per-prompt failures. This inconsistency makes error handling brittle for callers; consider standardizing the error format between single and batched query paths (including when using the cache).
| return f"Error: LM query failed - {e}" | |
| return f"Error: {e}" |
Implements memoization for llm_query and llm_query_batched to cache identical prompts and avoid redundant API calls during recursive RLM execution. Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute identical subproblems, causing exponential token cost. Solution: Thread-safe LRU cache keyed on (prompt, model) with optional TTL. Usage: from rlm import RLM from rlm.utils import LLMCallCache cache = LLMCallCache(max_size=1000) rlm = RLM(backend='openai', cache=cache) Changes: - rlm/utils/cache.py: Core cache implementation - rlm/environments/local_repl.py: Integration with _llm_query - rlm/core/rlm.py: cache parameter (local env only) - tests/test_cache.py: 47 unit + integration tests Behavior: - Opt-in (no breaking changes) - Zero changes to RLM recursion logic - Stats available via cache.stats.hit_rate Closes alexzhang13#82
e43c7f9 to
2e5be6a
Compare
|
bulk is a self-contained cache module (422 lines) and its test suite (750 lines). These have zero dependencies on RLM internals and could live as a separate package. The actual integration is surgical: one new parameter on RLM.init, a cache check in two LocalREPL methods, and some exports. That's it. |
Implements memoization for
llm_queryandllm_query_batchedto cache identical prompts and avoid redundant API calls during recursive RLM execution.Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute identical subproblems, causing exponential token cost.
Solution: Thread-safe LRU cache keyed on
(prompt, model)with optional TTL.Usage: