Skip to content

feat: Add LLM call memoization for llm_query (Issue #82)#118

Open
zamal-db wants to merge 1 commit intoalexzhang13:mainfrom
zamal-db:feature/llm-call-cache
Open

feat: Add LLM call memoization for llm_query (Issue #82)#118
zamal-db wants to merge 1 commit intoalexzhang13:mainfrom
zamal-db:feature/llm-call-cache

Conversation

@zamal-db
Copy link

Implements memoization for llm_query and llm_query_batched to cache identical prompts and avoid redundant API calls during recursive RLM execution.

Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute identical subproblems, causing exponential token cost.

Solution: Thread-safe LRU cache keyed on (prompt, model) with optional TTL.

Usage:

from rlm import RLM
from rlm.utils import LLMCallCache

cache = LLMCallCache(max_size=1000)
rlm = RLM(backend="openai", cache=cache)

Copilot AI review requested due to automatic review settings February 21, 2026 02:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements LLM call memoization to reduce redundant API calls during recursive RLM execution, addressing Issue #82. The implementation provides a thread-safe LRU cache with optional TTL for llm_query and llm_query_batched calls.

Changes:

  • Added comprehensive LLMCallCache implementation with LRU eviction and optional TTL
  • Integrated cache support into RLM and LocalREPL for the "local" environment type
  • Added extensive unit tests covering cache operations, thread safety, LRU behavior, and TTL expiration

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
rlm/utils/cache.py New cache implementation with LLMCallCache, CacheEntry, CacheStats classes and factory function
rlm/utils/init.py Exports cache classes for public API
rlm/core/rlm.py Adds cache parameter to RLM constructor and passes it to local environments
rlm/environments/local_repl.py Integrates cache into _llm_query and _llm_query_batched methods
tests/test_cache.py Comprehensive test suite for cache functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

message token count reaches this fraction of the model context limit (default 0.85).
cache: Optional LLMCallCache for memoizing llm_query calls. When provided, identical prompts
will return cached responses instead of making redundant API calls. Useful for recursive
workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache parameter documentation should mention that caching is only supported for the 'local' environment type. Users trying to use cache with other environment types (modal, docker, daytona, prime, e2b) will silently have their cache ignored without any warning or error, which could be confusing.

Consider either:

  1. Adding a note in the docstring: "Note: caching is only supported for environment='local'"
  2. Raising a warning or error if cache is provided with a non-local environment
  3. Implementing cache support for other environment types (if feasible)
Suggested change
workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).
workloads with overlapping subproblems (e.g., Fibonacci-like decomposition).
Note: caching is only supported for environment='local'; for other environments, the cache is ignored.

Copilot uses AI. Check for mistakes.
Comment on lines +336 to +344
results_list = []
for response in responses:
if not response.success:
results.append(f"Error: {response.error}")
results_list.append(f"Error: {response.error}")
else:
self._pending_llm_calls.append(response.chat_completion)
results.append(response.chat_completion.response)
results_list.append(response.chat_completion.response)

return results
return results_list
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent variable naming: In the cached path (line 299), the results are stored in 'results', while in the non-cached path (line 336), they're stored in 'results_list'. Both represent the same thing (a list of string responses) and should use the same name for consistency.

Recommend using 'results' for both paths to improve code readability.

Copilot uses AI. Check for mistakes.
# Cache the successful response
self.cache.set(prompt, model, result)

return [r if r is not None else "Error: Unexpected cache failure" for r in results]
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "Error: Unexpected cache failure" on line 329 is a defensive check that should never occur in normal operation. However, if it does occur, the error message doesn't provide enough information for debugging.

Consider making this more informative, e.g., including the index: f"Error: Unexpected cache failure for prompt at index {i}". Alternatively, raise an AssertionError instead since this represents a bug in the cache logic if it ever occurs.

Suggested change
return [r if r is not None else "Error: Unexpected cache failure" for r in results]
return [
r if r is not None else f"Error: Unexpected cache failure for prompt at index {i}"
for i, r in enumerate(results)
]

Copilot uses AI. Check for mistakes.
Comment on lines +259 to 268
# Define the actual API call function
def make_api_call() -> str:
request = LMRequest(prompt=prompt, model=model, depth=self.depth)
response = send_lm_request(self.lm_handler_address, request)

if not response.success:
return f"Error: {response.error}"
raise RuntimeError(response.error)

self._pending_llm_calls.append(response.chat_completion)
return response.chat_completion.response
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a cached response is returned, the RLMChatCompletion is not appended to _pending_llm_calls (line 267 only executes on cache miss). This means cached responses won't appear in the REPLResult.rlm_calls list returned by execute_code().

While this might be intentional (cached calls don't consume tokens/cost), it creates an inconsistency where some llm_query calls are tracked and others aren't. Consider either:

  1. Always tracking calls with a flag indicating cache hits (preserves full call history)
  2. Documenting this behavior clearly in the cache documentation

The current behavior could confuse users debugging their RLM execution traces.

Copilot uses AI. Check for mistakes.
Comment on lines +307 to +310
self.cache.stats.record_hit()
else:
uncached_prompts.append((i, prompt))
self.cache.stats.record_miss()
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cache.get() returns None due to an expired entry, it increments stats.expirations (cache.py:214), but this code then also calls record_miss() (line 310). This means expired entries are counted as both expirations AND misses, which could be confusing when analyzing cache statistics.

The semantics are unclear: should an expiration be considered a type of miss, or should they be mutually exclusive? Consider either:

  1. Documenting that expirations are a subset of misses
  2. Having get() return a sentinel value to distinguish "not found" from "expired" so they can be counted separately
  3. Not counting expirations as misses if they're already tracked separately
Suggested change
self.cache.stats.record_hit()
else:
uncached_prompts.append((i, prompt))
self.cache.stats.record_miss()
else:
uncached_prompts.append((i, prompt))

Copilot uses AI. Check for mistakes.
Comment on lines +280 to +296
# Try cache first
cached = self.get(prompt, model)
if cached is not None:
with self._lock:
self._stats.hits += 1
return cached, True

# Cache miss - make the call
with self._lock:
self._stats.misses += 1

result = call_fn()

# Store in cache
self.set(prompt, model, result)

return result, False
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: Multiple threads can make redundant API calls for the same prompt. The get_or_call method checks the cache (line 281), releases the lock, then makes the API call (line 291). If multiple threads call get_or_call with the same prompt simultaneously, they will all see a cache miss and all make API calls.

This "thundering herd" problem means the cache won't prevent redundant API calls in concurrent scenarios with identical prompts. A common solution is to use a "single-flight" pattern where only the first thread makes the call and others wait for its result. This can be implemented with per-key locks or condition variables.

While this may be acceptable depending on usage patterns, it should be documented as a known limitation if not fixed.

Copilot uses AI. Check for mistakes.
hash1 = LLMCallCache._hash_key(unicode_prompt, "model")
hash2 = LLMCallCache._hash_key(unicode_prompt, "model")

assert hash1 == hash2
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite comprehensively tests the LLMCallCache in isolation, but there are no integration tests that verify the cache works correctly when integrated with RLM and LocalREPL.

Consider adding integration tests that:

  1. Create an RLM with a cache and verify identical prompts are cached
  2. Test that cache statistics are correctly tracked through actual RLM execution
  3. Verify the interaction between caching and _pending_llm_calls tracking
  4. Test that cached vs non-cached calls produce identical results from the user's perspective
Suggested change
assert hash1 == hash2
assert hash1 == hash2
class FakeRLM:
"""Minimal RLM-like class to test integration with LLMCallCache.
This simulates how an RLM might:
- Use an LLMCallCache for prompt/model pairs.
- Track pending LLM calls via a _pending_llm_calls set.
"""
def __init__(self, cache: LLMCallCache):
self.cache = cache
self._pending_llm_calls = set()
self._call_count = 0
def llm_call(self, prompt: str, model: str = "gpt-4") -> str:
"""Simulate an LLM call that goes through the cache."""
key = (prompt, model)
self._pending_llm_calls.add(key)
def do_call() -> str:
# Underlying model call that should only run on cache misses.
self._call_count += 1
return f"response for {prompt!r} with model {model!r}"
try:
return self.cache.get_or_call(prompt, model, do_call)
finally:
# Ensure pending calls tracking is always cleaned up.
self._pending_llm_calls.discard(key)
class FakeLocalREPL:
"""Minimal LocalREPL-like wrapper around an RLM instance."""
def __init__(self, rlm: FakeRLM):
self._rlm = rlm
def run_prompt(self, prompt: str, model: str = "gpt-4") -> str:
"""Simulate a user sending a prompt through a REPL."""
return self._rlm.llm_call(prompt, model)
class TestCacheIntegrationWithRLMAndLocalREPL:
"""Integration-style tests for LLMCallCache with RLM and LocalREPL.
These tests verify:
1. Identical prompts are cached when going through an RLM and REPL.
2. Cache statistics are updated through real execution paths.
3. _pending_llm_calls interacts correctly with cached/non-cached calls.
4. From the user's perspective, cached vs non-cached calls behave identically.
"""
def test_identical_prompts_are_cached_via_rlm_and_repl(self):
"""Create an RLM with a cache and verify identical prompts are cached."""
cache = create_cache(enabled=True, max_size=10)
rlm = FakeRLM(cache)
repl = FakeLocalREPL(rlm)
prompt = "Explain caching."
# First call should be a cache miss and invoke the underlying model.
response1 = repl.run_prompt(prompt, model="gpt-4")
# Second call with identical prompt/model should be served from cache.
response2 = repl.run_prompt(prompt, model="gpt-4")
# From the user's perspective, responses must be identical.
assert response1 == response2
# Underlying model should have been invoked only once.
assert rlm._call_count == 1
# Verify cache statistics reflect one miss (first call) and one hit (second).
stats: CacheStats = cache.get_stats()
assert stats.misses == 1
assert stats.hits == 1
def test_pending_llm_calls_tracking_with_cache(self):
"""Verify interaction between caching and _pending_llm_calls tracking."""
cache = create_cache(enabled=True, max_size=10)
rlm = FakeRLM(cache)
prompt = "Track pending calls."
model = "gpt-4"
# On first call, we should see the key added then removed.
response1 = rlm.llm_call(prompt, model=model)
assert isinstance(response1, str)
assert rlm._call_count == 1
# After the call completes, there should be no pending calls.
assert not rlm._pending_llm_calls
# Second call should be served from cache; pending set still must be cleaned.
response2 = rlm.llm_call(prompt, model=model)
assert response2 == response1
assert rlm._call_count == 1 # still only the original miss
assert not rlm._pending_llm_calls
def test_cached_and_non_cached_calls_identical_from_user_perspective(self):
"""Ensure cached vs non-cached calls produce identical visible results."""
cache = create_cache(enabled=True, max_size=10)
rlm = FakeRLM(cache)
repl = FakeLocalREPL(rlm)
# First prompt will have a cache miss on first call, hit on second.
prompt_cached = "What is an LRU cache?"
# Second prompt is only called once (always a miss).
prompt_uncached = "Describe TTL-based expiration."
# Calls that will be cached.
cached_first = repl.run_prompt(prompt_cached)
cached_second = repl.run_prompt(prompt_cached)
# Call that is never repeated (always non-cached at the time of the call).
uncached = repl.run_prompt(prompt_uncached)
# User-visible behavior: repeated calls return the same response.
assert cached_first == cached_second
# Different prompts yield different responses.
assert cached_first != uncached
# Stats: two distinct prompts, three total calls, with one cache hit.
stats: CacheStats = cache.get_stats()
assert stats.misses == 2 # first time for each distinct prompt
assert stats.hits == 1 # second call for prompt_cached
# Underlying model should have run exactly once per distinct prompt.
assert rlm._call_count == 2

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +292 to +316
def test_expired_entry_returns_none(self):
"""Test that expired entries return None."""
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work
assert cache.get("prompt", "model") == "response"

# Wait for expiration
time.sleep(0.15)

# Should be expired
assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self):
"""Test that expirations update statistics."""
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

time.sleep(0.1)

cache.get("prompt", "model") # Triggers expiration check

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These TTL tests depend on very small sleep intervals (100–150ms), which can be flaky on slower/loaded CI runners. Prefer controlling time (e.g., monkeypatch time.time) or increasing the margin significantly so the test is deterministic.

Suggested change
def test_expired_entry_returns_none(self):
"""Test that expired entries return None."""
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL
cache.set("prompt", "model", "response")
# Immediate access should work
assert cache.get("prompt", "model") == "response"
# Wait for expiration
time.sleep(0.15)
# Should be expired
assert cache.get("prompt", "model") is None
def test_expiration_updates_stats(self):
"""Test that expirations update statistics."""
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL
cache.set("prompt", "model", "response")
time.sleep(0.1)
cache.get("prompt", "model") # Triggers expiration check
def test_expired_entry_returns_none(self, monkeypatch):
"""Test that expired entries return None."""
# Control time to make the test deterministic instead of relying on sleep.
base_time = 1_000.0
current_time = base_time
def fake_time():
return current_time
# Patch the time used inside the cache implementation.
monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time)
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL
cache.set("prompt", "model", "response")
# Immediate access should work at base_time.
assert cache.get("prompt", "model") == "response"
# Advance time beyond the TTL to trigger expiration.
current_time = base_time + 0.2
# Should be expired now.
assert cache.get("prompt", "model") is None
def test_expiration_updates_stats(self, monkeypatch):
"""Test that expirations update statistics."""
base_time = 2_000.0
current_time = base_time
def fake_time():
return current_time
# Patch the time used inside the cache implementation.
monkeypatch.setattr("rlm.utils.cache.time", "time", fake_time)
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL
cache.set("prompt", "model", "response")
# Advance time beyond the TTL so that the next access expires the entry.
current_time = base_time + 0.1
cache.get("prompt", "model") # Triggers expiration check

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +316
def test_expired_entry_returns_none(self):
"""Test that expired entries return None."""
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL

cache.set("prompt", "model", "response")

# Immediate access should work
assert cache.get("prompt", "model") == "response"

# Wait for expiration
time.sleep(0.15)

# Should be expired
assert cache.get("prompt", "model") is None

def test_expiration_updates_stats(self):
"""Test that expirations update statistics."""
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL

cache.set("prompt", "model", "response")

time.sleep(0.1)

cache.get("prompt", "model") # Triggers expiration check

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TTL expiration test uses a 50ms TTL and sleeps 100ms; that tight timing can be flaky in CI. Consider using a mocked clock (monkeypatch time.time) or a much larger TTL/sleep delta to avoid intermittent failures.

Suggested change
def test_expired_entry_returns_none(self):
"""Test that expired entries return None."""
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL
cache.set("prompt", "model", "response")
# Immediate access should work
assert cache.get("prompt", "model") == "response"
# Wait for expiration
time.sleep(0.15)
# Should be expired
assert cache.get("prompt", "model") is None
def test_expiration_updates_stats(self):
"""Test that expirations update statistics."""
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL
cache.set("prompt", "model", "response")
time.sleep(0.1)
cache.get("prompt", "model") # Triggers expiration check
def test_expired_entry_returns_none(self, monkeypatch):
"""Test that expired entries return None."""
# Use a mocked clock to avoid flaky timing-based tests.
fake_now = [1000.0]
def fake_time() -> float:
return fake_now[0]
monkeypatch.setattr(time, "time", fake_time)
cache = LLMCallCache(ttl_seconds=0.1) # 100ms TTL
cache.set("prompt", "model", "response")
# Immediate access should work
assert cache.get("prompt", "model") == "response"
# Advance time beyond the TTL to simulate expiration
fake_now[0] += 0.11
# Should be expired
assert cache.get("prompt", "model") is None
def test_expiration_updates_stats(self, monkeypatch):
"""Test that expirations update statistics."""
# Use a mocked clock to deterministically trigger expiration.
fake_now = [2000.0]
def fake_time() -> float:
return fake_now[0]
monkeypatch.setattr(time, "time", fake_time)
cache = LLMCallCache(ttl_seconds=0.05) # 50ms TTL
cache.set("prompt", "model", "response")
# Advance time beyond the TTL so the next get triggers expiration.
fake_now[0] += 0.06
cache.get("prompt", "model") # Triggers expiration check

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +186
# Normalize model to string
model_str = model if model else "__default__"

# Combine prompt and model for the key
content = f"{model_str}:{prompt}"

# Use SHA-256 for good distribution and collision resistance
return hashlib.sha256(content.encode("utf-8")).hexdigest()
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_hash_key builds the hash input as f"{model_str}:{prompt}", which can collide when model contains ':' or prompt begins with ':' (e.g., model='a:' + prompt='b' equals model='a' + prompt=':b'). Use an unambiguous encoding (e.g., length-prefixing, '\0' separator, or hashing (model, prompt) as structured bytes/JSON) to avoid cache key collisions.

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +244
# Evict oldest if at capacity
while len(self._cache) >= self.max_size:
self._cache.popitem(last=False)
self._stats.evictions += 1
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMCallCache.set() can raise or loop incorrectly when max_size <= 0 (e.g., while len(_cache) >= max_size will be true on an empty cache). Consider validating max_size >= 1 in post_init (or treating <=0 as 'disable caching') to prevent KeyError/infinite eviction loops.

Copilot uses AI. Check for mistakes.
Comment on lines +263 to +264
This is the primary method for cache integration. It handles the full
cache-check -> call -> cache-store flow atomically.
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_or_call docstring claims the cache-check -> call -> cache-store flow is atomic, but the implementation releases the lock before call_fn() and doesn't coordinate concurrent misses. Multiple threads can call the API for the same (prompt, model) simultaneously, defeating memoization. Either adjust the docstring or implement single-flight behavior (per-key lock/event/future) so only one in-flight call populates the cache.

Suggested change
This is the primary method for cache integration. It handles the full
cache-check -> call -> cache-store flow atomically.
This is the primary method for cache integration. It performs the full
cache-check -> call -> cache-store flow, but this sequence is not
guaranteed to be atomic across threads. Concurrent cache misses for
the same (prompt, model) may result in multiple calls to ``call_fn``.

Copilot uses AI. Check for mistakes.
Comment on lines +330 to +332
def stats(self) -> CacheStats:
"""Get cache statistics (read-only view)."""
return self._stats
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stats is documented as a "read-only view" but returns the mutable internal CacheStats object directly, allowing callers to mutate counters without LLMCallCache's lock (breaking the thread-safety guarantee). Consider returning a copy/snapshot (e.g., to_dict()) or providing locked accessor methods for reading/updating stats.

Suggested change
def stats(self) -> CacheStats:
"""Get cache statistics (read-only view)."""
return self._stats
def stats(self) -> dict[str, Any]:
"""Get cache statistics as a read-only snapshot."""
return self._stats.to_dict()

Copilot uses AI. Check for mistakes.
Comment on lines +303 to +311
for i, prompt in enumerate(prompts):
cached = self.cache.get(prompt, model)
if cached is not None:
results[i] = cached
self.cache.stats.record_hit()
else:
uncached_prompts.append((i, prompt))
self.cache.stats.record_miss()

Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the caching path, _llm_query_batched mutates cache.stats via record_hit()/record_miss() without any synchronization. Since LLMCallCache claims thread safety, these counter updates should be performed under the cache's lock (e.g., via a cache method that records hits/misses internally), otherwise concurrent batched calls can corrupt stats.

Copilot uses AI. Check for mistakes.
else:
return make_api_call()
except Exception as e:
return f"Error: LM query failed - {e}"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_llm_query returns errors as "Error: LM query failed - ..." while _llm_query_batched returns "Error: {response.error}" for per-prompt failures. This inconsistency makes error handling brittle for callers; consider standardizing the error format between single and batched query paths (including when using the cache).

Suggested change
return f"Error: LM query failed - {e}"
return f"Error: {e}"

Copilot uses AI. Check for mistakes.
Implements memoization for llm_query and llm_query_batched to cache identical
prompts and avoid redundant API calls during recursive RLM execution.

Problem: Recursive workloads (e.g., Fibonacci-like decomposition) recompute
identical subproblems, causing exponential token cost.

Solution: Thread-safe LRU cache keyed on (prompt, model) with optional TTL.

Usage:
  from rlm import RLM
  from rlm.utils import LLMCallCache

  cache = LLMCallCache(max_size=1000)
  rlm = RLM(backend='openai', cache=cache)

Changes:
- rlm/utils/cache.py: Core cache implementation
- rlm/environments/local_repl.py: Integration with _llm_query
- rlm/core/rlm.py: cache parameter (local env only)
- tests/test_cache.py: 47 unit + integration tests

Behavior:
- Opt-in (no breaking changes)
- Zero changes to RLM recursion logic
- Stats available via cache.stats.hit_rate

Closes alexzhang13#82
@zamal-db zamal-db force-pushed the feature/llm-call-cache branch from e43c7f9 to 2e5be6a Compare February 21, 2026 02:41
@zamal-db
Copy link
Author

zamal-db commented Feb 21, 2026

bulk is a self-contained cache module (422 lines) and its test suite (750 lines). These have zero dependencies on RLM internals and could live as a separate package.

The actual integration is surgical: one new parameter on RLM.init, a cache check in two LocalREPL methods, and some exports. That's it.
Behavior is 100% opt-in with cache=None default. Zero breaking changes. All 305 existing tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants