feat: Add RAG evaluation module to LISA SDK#956
Merged
gingerknight merged 6 commits intodevelopfrom Apr 21, 2026
Merged
Conversation
bedanley
previously approved these changes
Apr 14, 2026
bedanley
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Built a RAG evaluation framework for the LISA SDK that lets you measure retrieval quality across different backends. Give it a golden dataset (queries + expected documents), point it at the RAG setup, and it produces Precision@k, Recall@k, and NDCG@k scores.
Works with Bedrock Knowledge Bases and LISA's own RAG backends (OpenSearch, PGVector). Run it from CLI, get a comparison table showing which backend retrieves better. Useful for tuning chunking strategies, or comparing embedding models. The core module lives in
lisa-sdk/lisapy/evaluation/. Point it at a YAML config with backends and documents, give it a JSONL file with test queries, and it runs the evaluation.Main evaluation engine:
Evaluators for Bedrock KB (
retrieve()API) and LISA API (similarity_search()). Each pulls results, deduplicates chunks back to documents, computes IR metrics, stores per-query details.Metrics implementation:
Standard IR metrics with document-level deduplication. If a document appears in 5 chunks, we only count its first occurrence for ranking purposes.
Config and datasets:
Pydantic-validated YAML configs define backends and document mappings. JSONL datasets specify queries, expected docs, relevance grades, and query types. The config loader merges document names with S3 bucket prefixes per backend.
Authentication Enhancements
Moved authentication helpers from integration test utilities into SDK module, making them available for general use. Added
setup_authentication()helper that combines management key retrieval and DynamoDB token registration. Enhancedget_management_key()with multi-pattern secret name resolution.CLI:
python -m lisapy.evaluation --config eval_config.yamlruns everything, prints formatted tables with metrics and comparisons.Tests:
35 unit tests covering all modules. Mock boto3 and LISA SDK calls. Example configs with sanitized data in
test/integration/rag/eval_datasets/.Docs:
Full guide at
lib/docs/config/rag-evaluation.md— setup, config format, metrics explanation, how to interpret results.Example Output
$ python -m lisapy.evaluation --config eval_config.yaml --dataset golden-dataset.jsonl RAG Evaluation — Precision@k, Recall@k, NDCG@k Golden dataset: 12 queries, k=5 Query types: {'semantic': 8, 'keyword': 3, 'exact': 1} Running Bedrock KB evaluation... ====================================================================== Bedrock KB — Evaluation Results (k=5) ====================================================================== Precision@5: 0.867 Recall@5: 0.917 NDCG@5: 0.891 Running OpenSearch evaluation... ====================================================================== OpenSearch — Evaluation Results (k=5) ====================================================================== Precision@5: 0.800 Recall@5: 0.833 NDCG@5: 0.824 ====================================================================== Cross-Backend Comparison (k=5) ====================================================================== Metric Bedrock KB OpenSearch --------------- ------------ ----------- Precision@5 0.867 0.800 Recall@5 0.917 0.833 NDCG@5 0.891 0.824 Pairwise Deltas: Comparison P@5 R@5 NDCG@5 ---------------------------- ------- ----- ------- OpenSearch vs Bedrock KB -0.067 -0.084 -0.067By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.