Skip to content

Improve LLM reranker performance with configurable concurrency #4303

@dishafaujdar

Description

@dishafaujdar

🚀 The feature

This PR introduces optional concurrency inside LLMReranker.rerank() to reduce end-to-end search latency when scoring large candidate sets. The change is fully backwards compatible — sequential behavior is preserved by default.

Problem:

LLMReranker.rerank() currently scores documents sequentially — one synchronous LLM call per document, run one after another in a for loop. Since each (query, document) scoring call is independent, this creates unnecessary latency that grows linearly with candidate set size.

For LLM providers with non-trivial per-call latency (200ms–1s+), scoring 20–100 candidates becomes a significant bottleneck in the search path.

Affected paths:

  • Sync: Memory.search() calls self.reranker.rerank() directly
  • Async: AsyncMemory.search() wraps it in await asyncio.to_thread(self.reranker.rerank, ...)
    In both cases, the internal sequential loop blocks the entire reranking step.

Proposed Change:

Introduce an optional max_concurrency field in LLMRerankerConfig that controls a ThreadPoolExecutor inside rerank():

  1. New config field in LLMRerankerConfig, e.g.:
  • max_concurrency: Optional[int] = None
  • None or <= 1 → preserve current behavior (fully sequential).
  • (> 1) → use a ThreadPoolExecutor(max_workers=max_concurrency).
  1. Implementation sketch:
  • Keep rerank as a synchronous function.
  • Inside rerank:
    • Build (index, doc) pairs.
    • In a ThreadPoolExecutor, submit one job per doc:
      • job = build prompt, call llm.generate_response, parse score via _extract_score.
    • Collect results via as_completed, map back to indices.
    • On per-doc failure, log and assign a neutral score (current behavior is 0.5).
    • Attach rerank_score to each doc, sort descending, apply top_k as today.
  1. Async compatibility:
  • The async Memory.search flow can continue to call await asyncio.to_thread(self.reranker.rerank, ...).
  • From the async caller’s perspective it’s still a single blocking call, but internally the reranker can leverage a small thread pool for LLM calls.

Benefits:

  • Reranking latency reduced from O(n * llm_latency) to approximately O(llm_latency) at sufficient concurrency
  • Backwards compatible: same public API, same output shape, same default behavior
  • Safer than batched single-prompt strategies — no changes to prompt contract or JSON parsing behavior across providers
  • Per-document failure handling preserved: failed calls fall back to neutral score 0.5 with a warning log

Files Changed:

  • mem0/configs/rerankers/llm.py — add max_concurrency field
  • mem0/reranker/llm_reranker.py — add parallel scoring path via ThreadPoolExecutor

If this approach sounds reasonable, I’d be happy to open a PR implementing it along with tests (using a mocked llm.generate_response) to verify behavior for:

  • sequential vs concurrent modes,
  • correct mapping of scores to docs,
  • and error handling per document.

Motivation, pitch

I’m using Mem0 in a setting where search calls often return a moderate to large number of candidate memories (e.g. 20–100) that are passed through the LLM-based reranker. Right now, LLMReranker.rerank calls the LLM once per document sequentially, so reranking latency grows roughly linearly with the number of candidates. This becomes a noticeable bottleneck in end‑to‑end response time, especially when the upstream LLM provider has non‑trivial latency per call or when the system needs to handle many concurrent users.

Because each reranking operation is logically independent (it just scores (query, doc_text) pairs), this seems like a good place to introduce safe concurrency with a configurable limit, improving performance and scalability without changing the public API or overall behavior of Mem0’s search pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions