feat: Add RAG evaluation module to LISA SDK by gingerknight · Pull Request #956 · awslabs/LISA

gingerknight · 2026-04-14T19:02:10Z

Summary

Built a RAG evaluation framework for the LISA SDK that lets you measure retrieval quality across different backends. Give it a golden dataset (queries + expected documents), point it at the RAG setup, and it produces Precision@k, Recall@k, and NDCG@k scores.

Works with Bedrock Knowledge Bases and LISA's own RAG backends (OpenSearch, PGVector). Run it from CLI, get a comparison table showing which backend retrieves better. Useful for tuning chunking strategies, or comparing embedding models. The core module lives in lisa-sdk/lisapy/evaluation/. Point it at a YAML config with backends and documents, give it a JSONL file with test queries, and it runs the evaluation.

Outputs aggregate metrics, per-query breakdowns, and cross-backend comparisons with deltas.

Main evaluation engine:

Evaluators for Bedrock KB (retrieve() API) and LISA API (similarity_search()). Each pulls results, deduplicates chunks back to documents, computes IR metrics, stores per-query details.

Metrics implementation:

Standard IR metrics with document-level deduplication. If a document appears in 5 chunks, we only count its first occurrence for ranking purposes.

Config and datasets:

Pydantic-validated YAML configs define backends and document mappings. JSONL datasets specify queries, expected docs, relevance grades, and query types. The config loader merges document names with S3 bucket prefixes per backend.

Authentication Enhancements

Moved authentication helpers from integration test utilities into SDK module, making them available for general use. Added setup_authentication() helper that combines management key retrieval and DynamoDB token registration. Enhanced get_management_key() with multi-pattern secret name resolution.

CLI:

python -m lisapy.evaluation --config eval_config.yaml runs everything, prints formatted tables with metrics and comparisons.

Tests:

35 unit tests covering all modules. Mock boto3 and LISA SDK calls. Example configs with sanitized data in test/integration/rag/eval_datasets/.

Docs:

Full guide at lib/docs/config/rag-evaluation.md — setup, config format, metrics explanation, how to interpret results.

Example Output

$ python -m lisapy.evaluation --config eval_config.yaml --dataset golden-dataset.jsonl

RAG Evaluation — Precision@k, Recall@k, NDCG@k
Golden dataset: 12 queries, k=5
Query types: {'semantic': 8, 'keyword': 3, 'exact': 1}

Running Bedrock KB evaluation...
======================================================================
  Bedrock KB — Evaluation Results (k=5)
======================================================================
  Precision@5:  0.867
  Recall@5:     0.917
  NDCG@5:       0.891

Running OpenSearch evaluation...
======================================================================
  OpenSearch — Evaluation Results (k=5)
======================================================================
  Precision@5:  0.800
  Recall@5:     0.833
  NDCG@5:       0.824

======================================================================
  Cross-Backend Comparison (k=5)
======================================================================
  Metric           Bedrock KB   OpenSearch
  ---------------  ------------  -----------
  Precision@5         0.867        0.800
  Recall@5            0.917        0.833
  NDCG@5              0.891        0.824

  Pairwise Deltas:
  Comparison                   P@5      R@5    NDCG@5
  ----------------------------  -------  -----  -------
  OpenSearch vs Bedrock KB     -0.067  -0.084  -0.067

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

gingerknight added 2 commits April 14, 2026 11:40

feat: Add RAG evaluation module to LISA SDK

db3f746

pre and fix noisy logging

3ee7f95

gingerknight marked this pull request as ready for review April 14, 2026 19:07

Merge branch 'develop' into feat/rag-evaluation

e5ac570

bedanley previously approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'develop' into feat/rag-evaluation

32f05f3

gingerknight marked this pull request as draft April 16, 2026 17:16

refactor: abstraction, error handling, test conf, and auth improvements

060bfce

gingerknight dismissed bedanley’s stale review via 060bfce April 16, 2026 18:21

gingerknight marked this pull request as ready for review April 16, 2026 20:14

bedanley approved these changes Apr 21, 2026

View reviewed changes

Merge branch 'develop' into feat/rag-evaluation

f5084ad

gingerknight merged commit 8007a93 into develop Apr 21, 2026
8 checks passed

gingerknight deleted the feat/rag-evaluation branch April 21, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add RAG evaluation module to LISA SDK#956

feat: Add RAG evaluation module to LISA SDK#956
gingerknight merged 6 commits intodevelopfrom
feat/rag-evaluation

gingerknight commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gingerknight commented Apr 14, 2026

Summary

Main evaluation engine:

Metrics implementation:

Config and datasets:

Authentication Enhancements

CLI:

Tests:

Docs:

Example Output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants