perf: optimize algorithms and data structures for improved performance #2456
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Description
This PR introduces a set of performance-focused changes across the Ragas codebase. The goal is to reduce unnecessary overhead in evaluation and sampling paths while keeping all outputs fully consistent with the existing behavior. Most updates replace repeated or quadratic operations with linear or cached implementations, resulting in noticeably faster runs on larger datasets.
Key Optimizations
Critical Impact Changes
1. Average Precision Calculation (O(n²) → O(n))
Files Modified:
src/ragas/metrics/_context_precision.pysrc/ragas/metrics/collections/context_precision/metric.pyThe previous implementation recalculated cumulative sums inside a loop. This PR replaces it with a single-pass cumulative sum approach. This brings down the time cost for average precision calculations, especially when the number of retrieved contexts is large.
Before:
After:
2. Node Lookup Optimization (O(n) → O(1))
Files Modified:
src/ragas/testset/graph.pyRepeated linear scans over graph nodes caused noticeable slowdown during test set generation. A dedicated
_node_id_cacheis added to support O(1) lookups. The cache reconstructs itself automatically after deserialization to avoid stale state.3. Stratified Sampling Optimization (O(n²) → O(n))
Files Modified:
src/ragas/dataset_schema.pyThe sampling loop previously rebuilt sets and lists on each iteration. The updated code computes the shortage once, determines remaining indices, and uses
random.sample()to fetch all missing items in one step. This reduces overhead for large datasets.Before:
After:
High Impact Changes
4. Vectorized Hamming Distance
Files Modified:
src/ragas/optimizers/utils.pyDistance computation is now implemented using
scipy.spatial.distanceutilities instead of nested Python loops. This shifts work to optimized C-backed functions and simplifies the code. The new version also ensures a symmetric distance matrix.Before:
After:
5. Persona Lookup Optimization (O(n) → O(1))
Files Modified:
src/ragas/testset/persona.pyA
_name_cachelookup table is added and initialized automatically. This avoids repeated linear scans when resolving persona entries and keeps compatibility with Pydantic’s initialization flow.Medium Impact Changes
6. Batch Creation Cleanup
Files Modified:
src/ragas/dataset_schema.pyAvoids evaluating the same slice twice by storing it in a variable before reuse. This slightly improves batch-related operations and makes the code easier to follow.
7. LLM Type Checking Streamline
Files Modified:
src/ragas/llms/base.pyReplaces a looped type check with a tuple-based
isinstance()call. While not a major performance change, it simplifies the logic and reduces overhead for repeated checks.Before:
After:
8. Counter Usage Simplification
Files Modified:
src/ragas/metrics/base.pyReplaces a multi-step process to find the most common element with
Counter.most_common(1). This avoids unnecessary intermediate structures.Design Notes
Cache Management
Both node and persona lookup caches rebuild automatically when needed, keeping lookup operations efficient without requiring callers to manage state.
Backward Compatibility
All optimizations preserve existing behavior, and test suites should pass without any required changes.
Dependencies
scipyis used for vectorized distance calculations. It is already part of the project dependencies.