feat: Add LLM call memoization for llm_query (Issue #82) by zamal-db · Pull Request #118 · alexzhang13/rlm

zamal-db · 2026-02-21T02:22:40Z

Implements memoization for llm_query and llm_query_batched to cache identical prompts and avoid redundant API calls during recursive RLM execution.

Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute identical subproblems, causing exponential token cost.

Solution: Thread-safe LRU cache keyed on (prompt, model) with optional TTL.

Usage:

from rlm import RLM
from rlm.utils import LLMCallCache

cache = LLMCallCache(max_size=1000)
rlm = RLM(backend="openai", cache=cache)

Copilot

Pull request overview

This PR implements LLM call memoization to reduce redundant API calls during recursive RLM execution, addressing Issue #82. The implementation provides a thread-safe LRU cache with optional TTL for llm_query and llm_query_batched calls.

Changes:

Added comprehensive LLMCallCache implementation with LRU eviction and optional TTL
Integrated cache support into RLM and LocalREPL for the "local" environment type
Added extensive unit tests covering cache operations, thread safety, LRU behavior, and TTL expiration

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
rlm/utils/cache.py	New cache implementation with LLMCallCache, CacheEntry, CacheStats classes and factory function
rlm/utils/init.py	Exports cache classes for public API
rlm/core/rlm.py	Adds cache parameter to RLM constructor and passes it to local environments
rlm/environments/local_repl.py	Integrates cache into _llm_query and _llm_query_batched methods
tests/test_cache.py	Comprehensive test suite for cache functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T02:28:28Z

rlm/core/rlm.py

                message token count reaches this fraction of the model context limit (default 0.85).
+            cache: Optional LLMCallCache for memoizing llm_query calls. When provided, identical prompts
+                will return cached responses instead of making redundant API calls. Useful for recursive
+                workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).


The cache parameter documentation should mention that caching is only supported for the 'local' environment type. Users trying to use cache with other environment types (modal, docker, daytona, prime, e2b) will silently have their cache ignored without any warning or error, which could be confusing.

Consider either:

Adding a note in the docstring: "Note: caching is only supported for environment='local'"

Raising a warning or error if cache is provided with a non-local environment

Implementing cache support for other environment types (if feasible)

Suggested change

workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).

workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).

Note: caching is only supported for environment='local'; for other environments, the cache is ignored.

Copilot · 2026-02-21T02:28:28Z

rlm/environments/local_repl.py

+            results_list = []
            for response in responses:
                if not response.success:
-                    results.append(f"Error: {response.error}")
+                    results_list.append(f"Error: {response.error}")
                else:
                    self._pending_llm_calls.append(response.chat_completion)
-                    results.append(response.chat_completion.response)
+                    results_list.append(response.chat_completion.response)

-            return results
+            return results_list


Inconsistent variable naming: In the cached path (line 299), the results are stored in 'results', while in the non-cached path (line 336), they're stored in 'results_list'. Both represent the same thing (a list of string responses) and should use the same name for consistency.

Recommend using 'results' for both paths to improve code readability.

Copilot · 2026-02-21T02:28:29Z

rlm/environments/local_repl.py

+                            # Cache the successful response
+                            self.cache.set(prompt, model, result)
+
+                return [r if r is not None else "Error: Unexpected cache failure" for r in results]


The error message "Error: Unexpected cache failure" on line 329 is a defensive check that should never occur in normal operation. However, if it does occur, the error message doesn't provide enough information for debugging.

Consider making this more informative, e.g., including the index: f"Error: Unexpected cache failure for prompt at index {i}". Alternatively, raise an AssertionError instead since this represents a bug in the cache logic if it ever occurs.

Suggested change

return [r if r is not None else "Error: Unexpected cache failure" for r in results]

return [

r if r is not None else f"Error: Unexpected cache failure for prompt at index {i}"

for i, r in enumerate(results)

]

Copilot · 2026-02-21T02:28:29Z

rlm/environments/local_repl.py

+        # Define the actual API call function
+        def make_api_call() -> str:
            request = LMRequest(prompt=prompt, model=model, depth=self.depth)
            response = send_lm_request(self.lm_handler_address, request)

            if not response.success:
-                return f"Error: {response.error}"
+                raise RuntimeError(response.error)

            self._pending_llm_calls.append(response.chat_completion)
            return response.chat_completion.response


When a cached response is returned, the RLMChatCompletion is not appended to _pending_llm_calls (line 267 only executes on cache miss). This means cached responses won't appear in the REPLResult.rlm_calls list returned by execute_code().

While this might be intentional (cached calls don't consume tokens/cost), it creates an inconsistency where some llm_query calls are tracked and others aren't. Consider either:

Always tracking calls with a flag indicating cache hits (preserves full call history)

Documenting this behavior clearly in the cache documentation

The current behavior could confuse users debugging their RLM execution traces.

Copilot · 2026-02-21T02:28:29Z

rlm/environments/local_repl.py

+                        self.cache.stats.record_hit()
+                    else:
+                        uncached_prompts.append((i, prompt))
+                        self.cache.stats.record_miss()


When cache.get() returns None due to an expired entry, it increments stats.expirations (cache.py:214), but this code then also calls record_miss() (line 310). This means expired entries are counted as both expirations AND misses, which could be confusing when analyzing cache statistics.

The semantics are unclear: should an expiration be considered a type of miss, or should they be mutually exclusive? Consider either:

Documenting that expirations are a subset of misses

Having get() return a sentinel value to distinguish "not found" from "expired" so they can be counted separately

Not counting expirations as misses if they're already tracked separately

Suggested change

self.cache.stats.record_hit()

else:

uncached_prompts.append((i, prompt))

self.cache.stats.record_miss()

else:

uncached_prompts.append((i, prompt))

Copilot · 2026-02-21T02:28:29Z

rlm/utils/cache.py

+        # Try cache first
+        cached = self.get(prompt, model)
+        if cached is not None:
+            with self._lock:
+                self._stats.hits += 1
+            return cached, True
+
+        # Cache miss - make the call
+        with self._lock:
+            self._stats.misses += 1
+
+        result = call_fn()
+
+        # Store in cache
+        self.set(prompt, model, result)
+
+        return result, False


Race condition: Multiple threads can make redundant API calls for the same prompt. The get_or_call method checks the cache (line 281), releases the lock, then makes the API call (line 291). If multiple threads call get_or_call with the same prompt simultaneously, they will all see a cache miss and all make API calls.

This "thundering herd" problem means the cache won't prevent redundant API calls in concurrent scenarios with identical prompts. A common solution is to use a "single-flight" pattern where only the first thread makes the call and others wait for its result. This can be implemented with per-key locks or condition variables.

While this may be acceptable depending on usage patterns, it should be documented as a known limitation if not fixed.

Copilot · 2026-02-21T02:28:29Z

tests/test_cache.py

+        hash1 = LLMCallCache._hash_key(unicode_prompt, "model")
+        hash2 = LLMCallCache._hash_key(unicode_prompt, "model")
+
+        assert hash1 == hash2


The test suite comprehensively tests the LLMCallCache in isolation, but there are no integration tests that verify the cache works correctly when integrated with RLM and LocalREPL.

Consider adding integration tests that:

Create an RLM with a cache and verify identical prompts are cached

Test that cache statistics are correctly tracked through actual RLM execution

Verify the interaction between caching and _pending_llm_calls tracking

Test that cached vs non-cached calls produce identical results from the user's perspective

Suggested change

assert hash1 == hash2

assert hash1 == hash2

class FakeRLM:

"""Minimal RLM-like class to test integration with LLMCallCache.

This simulates how an RLM might:

- Use an LLMCallCache for prompt/model pairs.

- Track pending LLM calls via a _pending_llm_calls set.

"""

def __init__(self, cache: LLMCallCache):

self.cache = cache

self._pending_llm_calls = set()

self._call_count = 0

def llm_call(self, prompt: str, model: str = "gpt-4") -> str:

"""Simulate an LLM call that goes through the cache."""

key = (prompt, model)

self._pending_llm_calls.add(key)

def do_call() -> str:

# Underlying model call that should only run on cache misses.

self._call_count += 1

return f"response for {prompt!r} with model {model!r}"

try:

return self.cache.get_or_call(prompt, model, do_call)

finally:

# Ensure pending calls tracking is always cleaned up.

self._pending_llm_calls.discard(key)

class FakeLocalREPL:

"""Minimal LocalREPL-like wrapper around an RLM instance."""

def __init__(self, rlm: FakeRLM):

self._rlm = rlm

def run_prompt(self, prompt: str, model: str = "gpt-4") -> str:

"""Simulate a user sending a prompt through a REPL."""

return self._rlm.llm_call(prompt, model)

class TestCacheIntegrationWithRLMAndLocalREPL:

"""Integration-style tests for LLMCallCache with RLM and LocalREPL.

These tests verify:

1. Identical prompts are cached when going through an RLM and REPL.

2. Cache statistics are updated through real execution paths.

3. _pending_llm_calls interacts correctly with cached/non-cached calls.

4. From the user's perspective, cached vs non-cached calls behave identically.

"""

def test_identical_prompts_are_cached_via_rlm_and_repl(self):

"""Create an RLM with a cache and verify identical prompts are cached."""

cache = create_cache(enabled=True, max_size=10)

rlm = FakeRLM(cache)

repl = FakeLocalREPL(rlm)

prompt = "Explain caching."

# First call should be a cache miss and invoke the underlying model.

response1 = repl.run_prompt(prompt, model="gpt-4")

# Second call with identical prompt/model should be served from cache.

response2 = repl.run_prompt(prompt, model="gpt-4")

# From the user's perspective, responses must be identical.

assert response1 == response2

# Underlying model should have been invoked only once.

assert rlm._call_count == 1

# Verify cache statistics reflect one miss (first call) and one hit (second).

stats: CacheStats = cache.get_stats()

assert stats.misses == 1

assert stats.hits == 1

def test_pending_llm_calls_tracking_with_cache(self):

"""Verify interaction between caching and _pending_llm_calls tracking."""

cache = create_cache(enabled=True, max_size=10)

rlm = FakeRLM(cache)

prompt = "Track pending calls."

model = "gpt-4"

# On first call, we should see the key added then removed.

response1 = rlm.llm_call(prompt, model=model)

assert isinstance(response1, str)

assert rlm._call_count == 1

# After the call completes, there should be no pending calls.

assert not rlm._pending_llm_calls

# Second call should be served from cache; pending set still must be cleaned.

response2 = rlm.llm_call(prompt, model=model)

assert response2 == response1

assert rlm._call_count == 1 # still only the original miss

assert not rlm._pending_llm_calls

def test_cached_and_non_cached_calls_identical_from_user_perspective(self):

"""Ensure cached vs non-cached calls produce identical visible results."""

cache = create_cache(enabled=True, max_size=10)

rlm = FakeRLM(cache)

repl = FakeLocalREPL(rlm)

# First prompt will have a cache miss on first call, hit on second.

prompt_cached = "What is an LRU cache?"

# Second prompt is only called once (always a miss).

prompt_uncached = "Describe TTL-based expiration."

# Calls that will be cached.

cached_first = repl.run_prompt(prompt_cached)

cached_second = repl.run_prompt(prompt_cached)

# Call that is never repeated (always non-cached at the time of the call).

uncached = repl.run_prompt(prompt_uncached)

# User-visible behavior: repeated calls return the same response.

assert cached_first == cached_second

# Different prompts yield different responses.

assert cached_first != uncached

# Stats: two distinct prompts, three total calls, with one cache hit.

stats: CacheStats = cache.get_stats()

assert stats.misses == 2 # first time for each distinct prompt

assert stats.hits == 1 # second call for prompt_cached

# Underlying model should have run exactly once per distinct prompt.

assert rlm._call_count == 2

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-21T02:36:28Z

tests/test_cache.py

+    def test_expired_entry_returns_none(self):
+        """Test that expired entries return None."""
+        cache = LLMCallCache(ttl_seconds=0.1)  # 100ms TTL
+
+        cache.set("prompt", "model", "response")
+
+        # Immediate access should work
+        assert cache.get("prompt", "model") == "response"
+
+        # Wait for expiration
+        time.sleep(0.15)
+
+        # Should be expired
+        assert cache.get("prompt", "model") is None
+
+    def test_expiration_updates_stats(self):
+        """Test that expirations update statistics."""
+        cache = LLMCallCache(ttl_seconds=0.05)  # 50ms TTL
+
+        cache.set("prompt", "model", "response")
+
+        time.sleep(0.1)
+
+        cache.get("prompt", "model")  # Triggers expiration check
+


These TTL tests depend on very small sleep intervals (100–150ms), which can be flaky on slower/loaded CI runners. Prefer controlling time (e.g., monkeypatch time.time) or increasing the margin significantly so the test is deterministic.

Suggested change

def test_expired_entry_returns_none(self):

"""Test that expired entries return None."""

cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work

assert cache.get("prompt", "model") == "response"

# Wait for expiration

time.sleep(0.15)

# Should be expired

assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self):

"""Test that expirations update statistics."""

cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

time.sleep(0.1)

cache.get("prompt", "model") # Triggers expiration check

def test_expired_entry_returns_none(self, monkeypatch):

"""Test that expired entries return None."""

# Control time to make the test deterministic instead of relying on sleep.

base_time = 1_000.0

current_time = base_time

def fake_time():

return current_time

# Patch the time used inside the cache implementation.

monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time)

cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work at base_time.

assert cache.get("prompt", "model") == "response"

# Advance time beyond the TTL to trigger expiration.

current_time = base_time + 0.2

# Should be expired now.

assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self, monkeypatch):

"""Test that expirations update statistics."""

base_time = 2_000.0

current_time = base_time

def fake_time():

return current_time

# Patch the time used inside the cache implementation.

monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time)

cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

# Advance time beyond the TTL so that the next access expires the entry.

current_time = base_time + 0.1

cache.get("prompt", "model") # Triggers expiration check

Copilot · 2026-02-21T02:36:28Z

tests/test_cache.py

+    def test_expired_entry_returns_none(self):
+        """Test that expired entries return None."""
+        cache = LLMCallCache(ttl_seconds=0.1)  # 100ms TTL
+
+        cache.set("prompt", "model", "response")
+
+        # Immediate access should work
+        assert cache.get("prompt", "model") == "response"
+
+        # Wait for expiration
+        time.sleep(0.15)
+
+        # Should be expired
+        assert cache.get("prompt", "model") is None
+
+    def test_expiration_updates_stats(self):
+        """Test that expirations update statistics."""
+        cache = LLMCallCache(ttl_seconds=0.05)  # 50ms TTL
+
+        cache.set("prompt", "model", "response")
+
+        time.sleep(0.1)
+
+        cache.get("prompt", "model")  # Triggers expiration check
+


This TTL expiration test uses a 50ms TTL and sleeps 100ms; that tight timing can be flaky in CI. Consider using a mocked clock (monkeypatch time.time) or a much larger TTL/sleep delta to avoid intermittent failures.

Suggested change

def test_expired_entry_returns_none(self):

"""Test that expired entries return None."""

cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work

assert cache.get("prompt", "model") == "response"

# Wait for expiration

time.sleep(0.15)

# Should be expired

assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self):

"""Test that expirations update statistics."""

cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

time.sleep(0.1)

cache.get("prompt", "model") # Triggers expiration check

def test_expired_entry_returns_none(self, monkeypatch):

"""Test that expired entries return None."""

# Use a mocked clock to avoid flaky timing-based tests.

fake_now = [1000.0]

def fake_time() -> float:

return fake_now[0]

monkeypatch.setattr(time, "time", fake_time)

cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work

assert cache.get("prompt", "model") == "response"

# Advance time beyond the TTL to simulate expiration

fake_now[0] += 0.11

# Should be expired

assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self, monkeypatch):

"""Test that expirations update statistics."""

# Use a mocked clock to deterministically trigger expiration.

fake_now = [2000.0]

def fake_time() -> float:

return fake_now[0]

monkeypatch.setattr(time, "time", fake_time)

cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

# Advance time beyond the TTL so the next get triggers expiration.

fake_now[0] += 0.06

cache.get("prompt", "model") # Triggers expiration check

Copilot · 2026-02-21T02:36:28Z

rlm/utils/cache.py

+        # Normalize model to string
+        model_str = model if model else "__default__"
+
+        # Combine prompt and model for the key
+        content = f"{model_str}:{prompt}"
+
+        # Use SHA-256 for good distribution and collision resistance
+        return hashlib.sha256(content.encode("utf-8")).hexdigest()


_hash_key builds the hash input as f"{model_str}:{prompt}", which can collide when model contains ':' or prompt begins with ':' (e.g., model='a:' + prompt='b' equals model='a' + prompt=':b'). Use an unambiguous encoding (e.g., length-prefixing, '\0' separator, or hashing (model, prompt) as structured bytes/JSON) to avoid cache key collisions.

Copilot · 2026-02-21T02:36:28Z

rlm/utils/cache.py

+                # Evict oldest if at capacity
+                while len(self._cache) >= self.max_size:
+                    self._cache.popitem(last=False)
+                    self._stats.evictions += 1


LLMCallCache.set() can raise or loop incorrectly when max_size <= 0 (e.g., while len(_cache) >= max_size will be true on an empty cache). Consider validating max_size >= 1 in post_init (or treating <=0 as 'disable caching') to prevent KeyError/infinite eviction loops.

Copilot · 2026-02-21T02:36:29Z

rlm/utils/cache.py

+        This is the primary method for cache integration. It handles the full
+        cache-check -> call -> cache-store flow atomically.


The get_or_call docstring claims the cache-check -> call -> cache-store flow is atomic, but the implementation releases the lock before call_fn() and doesn't coordinate concurrent misses. Multiple threads can call the API for the same (prompt, model) simultaneously, defeating memoization. Either adjust the docstring or implement single-flight behavior (per-key lock/event/future) so only one in-flight call populates the cache.

Suggested change

This is the primary method for cache integration. It handles the full

cache-check -> call -> cache-store flow atomically.

This is the primary method for cache integration. It performs the full

cache-check -> call -> cache-store flow, but this sequence is not

guaranteed to be atomic across threads. Concurrent cache misses for

the same (prompt, model) may result in multiple calls to ``call_fn``.

Copilot · 2026-02-21T02:36:29Z

rlm/utils/cache.py

+    def stats(self) -> CacheStats:
+        """Get cache statistics (read-only view)."""
+        return self._stats


stats is documented as a "read-only view" but returns the mutable internal CacheStats object directly, allowing callers to mutate counters without LLMCallCache's lock (breaking the thread-safety guarantee). Consider returning a copy/snapshot (e.g., to_dict()) or providing locked accessor methods for reading/updating stats.

Suggested change

def stats(self) -> CacheStats:

"""Get cache statistics (read-only view)."""

return self._stats

def stats(self) -> dict[str, Any]:

"""Get cache statistics as a read-only snapshot."""

return self._stats.to_dict()

Copilot · 2026-02-21T02:36:29Z

rlm/environments/local_repl.py

+                for i, prompt in enumerate(prompts):
+                    cached = self.cache.get(prompt, model)
+                    if cached is not None:
+                        results[i] = cached
+                        self.cache.stats.record_hit()
+                    else:
+                        uncached_prompts.append((i, prompt))
+                        self.cache.stats.record_miss()
+


In the caching path, _llm_query_batched mutates cache.stats via record_hit()/record_miss() without any synchronization. Since LLMCallCache claims thread safety, these counter updates should be performed under the cache's lock (e.g., via a cache method that records hits/misses internally), otherwise concurrent batched calls can corrupt stats.

Copilot · 2026-02-21T02:36:29Z

rlm/environments/local_repl.py

+            else:
+                return make_api_call()
        except Exception as e:
            return f"Error: LM query failed - {e}"


_llm_query returns errors as "Error: LM query failed - ..." while _llm_query_batched returns "Error: {response.error}" for per-prompt failures. This inconsistency makes error handling brittle for callers; consider standardizing the error format between single and batched query paths (including when using the cache).

Suggested change

return f"Error: LM query failed - {e}"

return f"Error: {e}"

Implements memoization for llm_query and llm_query_batched to cache identical prompts and avoid redundant API calls during recursive RLM execution. Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute identical subproblems, causing exponential token cost. Solution: Thread-safe LRU cache keyed on (prompt, model) with optional TTL. Usage: from rlm import RLM from rlm.utils import LLMCallCache cache = LLMCallCache(max_size=1000) rlm = RLM(backend='openai', cache=cache) Changes: - rlm/utils/cache.py: Core cache implementation - rlm/environments/local_repl.py: Integration with _llm_query - rlm/core/rlm.py: cache parameter (local env only) - tests/test_cache.py: 47 unit + integration tests Behavior: - Opt-in (no breaking changes) - Zero changes to RLM recursion logic - Stats available via cache.stats.hit_rate Closes alexzhang13#82

zamal-db · 2026-02-21T02:44:16Z

bulk is a self-contained cache module (422 lines) and its test suite (750 lines). These have zero dependencies on RLM internals and could live as a separate package.

The actual integration is surgical: one new parameter on RLM.init, a cache check in two LocalREPL methods, and some exports. That's it.
Behavior is 100% opt-in with cache=None default. Zero breaking changes. All 305 existing tests pass.

Copilot AI review requested due to automatic review settings February 21, 2026 02:22

Copilot started reviewing on behalf of zamal-db February 21, 2026 02:23 View session

Copilot AI reviewed Feb 21, 2026

View reviewed changes

zamal-db force-pushed the feature/llm-call-cache branch from 5c29380 to e43c7f9 Compare February 21, 2026 02:31

zamal-db requested a review from Copilot February 21, 2026 02:32

Copilot started reviewing on behalf of zamal-db February 21, 2026 02:33 View session

Copilot AI reviewed Feb 21, 2026

View reviewed changes

zamal-db force-pushed the feature/llm-call-cache branch from e43c7f9 to 2e5be6a Compare February 21, 2026 02:41

	workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).
	workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).
	Note: caching is only supported for environment='local'; for other environments, the cache is ignored.

-        assert hash1 == hash2
+        assert hash1 == hash2
+class FakeRLM:
+    """Minimal RLM-like class to test integration with LLMCallCache.
+    This simulates how an RLM might:
+    - Use an LLMCallCache for prompt/model pairs.
+    - Track pending LLM calls via a _pending_llm_calls set.
+    """
+    def __init__(self, cache: LLMCallCache):
+        self.cache = cache
+        self._pending_llm_calls = set()
+        self._call_count = 0
+    def llm_call(self, prompt: str, model: str = "gpt-4") -> str:
+        """Simulate an LLM call that goes through the cache."""
+        key = (prompt, model)
+        self._pending_llm_calls.add(key)
+        def do_call() -> str:
+            # Underlying model call that should only run on cache misses.
+            self._call_count += 1
+            return f"response for {prompt!r} with model {model!r}"
+        try:
+            return self.cache.get_or_call(prompt, model, do_call)
+        finally:
+            # Ensure pending calls tracking is always cleaned up.
+            self._pending_llm_calls.discard(key)
+class FakeLocalREPL:
+    """Minimal LocalREPL-like wrapper around an RLM instance."""
+    def __init__(self, rlm: FakeRLM):
+        self._rlm = rlm
+    def run_prompt(self, prompt: str, model: str = "gpt-4") -> str:
+        """Simulate a user sending a prompt through a REPL."""
+        return self._rlm.llm_call(prompt, model)
+class TestCacheIntegrationWithRLMAndLocalREPL:
+    """Integration-style tests for LLMCallCache with RLM and LocalREPL.
+    These tests verify:
+. Identical prompts are cached when going through an RLM and REPL.
+. Cache statistics are updated through real execution paths.
+. _pending_llm_calls interacts correctly with cached/non-cached calls.
+. From the user's perspective, cached vs non-cached calls behave identically.
+    """
+    def test_identical_prompts_are_cached_via_rlm_and_repl(self):
+        """Create an RLM with a cache and verify identical prompts are cached."""
+        cache = create_cache(enabled=True, max_size=10)
+        rlm = FakeRLM(cache)
+        repl = FakeLocalREPL(rlm)
+        prompt = "Explain caching."
+        # First call should be a cache miss and invoke the underlying model.
+        response1 = repl.run_prompt(prompt, model="gpt-4")
+        # Second call with identical prompt/model should be served from cache.
+        response2 = repl.run_prompt(prompt, model="gpt-4")
+        # From the user's perspective, responses must be identical.
+        assert response1 == response2
+        # Underlying model should have been invoked only once.
+        assert rlm._call_count == 1
+        # Verify cache statistics reflect one miss (first call) and one hit (second).
+        stats: CacheStats = cache.get_stats()
+        assert stats.misses == 1
+        assert stats.hits == 1
+    def test_pending_llm_calls_tracking_with_cache(self):
+        """Verify interaction between caching and _pending_llm_calls tracking."""
+        cache = create_cache(enabled=True, max_size=10)
+        rlm = FakeRLM(cache)
+        prompt = "Track pending calls."
+        model = "gpt-4"
+        # On first call, we should see the key added then removed.
+        response1 = rlm.llm_call(prompt, model=model)
+        assert isinstance(response1, str)
+        assert rlm._call_count == 1
+        # After the call completes, there should be no pending calls.
+        assert not rlm._pending_llm_calls
+        # Second call should be served from cache; pending set still must be cleaned.
+        response2 = rlm.llm_call(prompt, model=model)
+        assert response2 == response1
+        assert rlm._call_count == 1  # still only the original miss
+        assert not rlm._pending_llm_calls
+    def test_cached_and_non_cached_calls_identical_from_user_perspective(self):
+        """Ensure cached vs non-cached calls produce identical visible results."""
+        cache = create_cache(enabled=True, max_size=10)
+        rlm = FakeRLM(cache)
+        repl = FakeLocalREPL(rlm)
+        # First prompt will have a cache miss on first call, hit on second.
+        prompt_cached = "What is an LRU cache?"
+        # Second prompt is only called once (always a miss).
+        prompt_uncached = "Describe TTL-based expiration."
+        # Calls that will be cached.
+        cached_first = repl.run_prompt(prompt_cached)
+        cached_second = repl.run_prompt(prompt_cached)
+        # Call that is never repeated (always non-cached at the time of the call).
+        uncached = repl.run_prompt(prompt_uncached)
+        # User-visible behavior: repeated calls return the same response.
+        assert cached_first == cached_second
+        # Different prompts yield different responses.
+        assert cached_first != uncached
+        # Stats: two distinct prompts, three total calls, with one cache hit.
+        stats: CacheStats = cache.get_stats()
+        assert stats.misses == 2  # first time for each distinct prompt
+        assert stats.hits == 1    # second call for prompt_cached
+        # Underlying model should have run exactly once per distinct prompt.
+        assert rlm._call_count == 2

		This is the primary method for cache integration. It handles the full
		cache-check -> call -> cache-store flow atomically.

-        This is the primary method for cache integration. It handles the full
-        cache-check -> call -> cache-store flow atomically.
+        This is the primary method for cache integration. It performs the full
+        cache-check -> call -> cache-store flow, but this sequence is not
+        guaranteed to be atomic across threads. Concurrent cache misses for
+        the same (prompt, model) may result in multiple calls to ``call_fn``.

Conversation

zamal-db commented Feb 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

Uh oh!

zamal-db commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zamal-db commented Feb 21, 2026 •

edited

Loading