Skip to content

feat: Add RAG evaluation module to LISA SDK#956

Merged
gingerknight merged 6 commits intodevelopfrom
feat/rag-evaluation
Apr 21, 2026
Merged

feat: Add RAG evaluation module to LISA SDK#956
gingerknight merged 6 commits intodevelopfrom
feat/rag-evaluation

Conversation

@gingerknight
Copy link
Copy Markdown
Contributor

Summary

Built a RAG evaluation framework for the LISA SDK that lets you measure retrieval quality across different backends. Give it a golden dataset (queries + expected documents), point it at the RAG setup, and it produces Precision@k, Recall@k, and NDCG@k scores.

Works with Bedrock Knowledge Bases and LISA's own RAG backends (OpenSearch, PGVector). Run it from CLI, get a comparison table showing which backend retrieves better. Useful for tuning chunking strategies, or comparing embedding models. The core module lives in lisa-sdk/lisapy/evaluation/. Point it at a YAML config with backends and documents, give it a JSONL file with test queries, and it runs the evaluation.

  • Outputs aggregate metrics, per-query breakdowns, and cross-backend comparisons with deltas.

Main evaluation engine:

Evaluators for Bedrock KB (retrieve() API) and LISA API (similarity_search()). Each pulls results, deduplicates chunks back to documents, computes IR metrics, stores per-query details.

Metrics implementation:

Standard IR metrics with document-level deduplication. If a document appears in 5 chunks, we only count its first occurrence for ranking purposes.

Config and datasets:

Pydantic-validated YAML configs define backends and document mappings. JSONL datasets specify queries, expected docs, relevance grades, and query types. The config loader merges document names with S3 bucket prefixes per backend.

Authentication Enhancements

Moved authentication helpers from integration test utilities into SDK module, making them available for general use. Added setup_authentication() helper that combines management key retrieval and DynamoDB token registration. Enhanced get_management_key() with multi-pattern secret name resolution.

CLI:

python -m lisapy.evaluation --config eval_config.yaml runs everything, prints formatted tables with metrics and comparisons.

Tests:

35 unit tests covering all modules. Mock boto3 and LISA SDK calls. Example configs with sanitized data in test/integration/rag/eval_datasets/.

Docs:

Full guide at lib/docs/config/rag-evaluation.md — setup, config format, metrics explanation, how to interpret results.

Example Output

$ python -m lisapy.evaluation --config eval_config.yaml --dataset golden-dataset.jsonl

RAG Evaluation — Precision@k, Recall@k, NDCG@k
Golden dataset: 12 queries, k=5
Query types: {'semantic': 8, 'keyword': 3, 'exact': 1}

Running Bedrock KB evaluation...
======================================================================
  Bedrock KB — Evaluation Results (k=5)
======================================================================
  Precision@5:  0.867
  Recall@5:     0.917
  NDCG@5:       0.891

Running OpenSearch evaluation...
======================================================================
  OpenSearch — Evaluation Results (k=5)
======================================================================
  Precision@5:  0.800
  Recall@5:     0.833
  NDCG@5:       0.824

======================================================================
  Cross-Backend Comparison (k=5)
======================================================================
  Metric           Bedrock KB   OpenSearch
  ---------------  ------------  -----------
  Precision@5         0.867        0.800
  Recall@5            0.917        0.833
  NDCG@5              0.891        0.824

  Pairwise Deltas:
  Comparison                   P@5      R@5    NDCG@5
  ----------------------------  -------  -----  -------
  OpenSearch vs Bedrock KB     -0.067  -0.084  -0.067

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@gingerknight gingerknight marked this pull request as ready for review April 14, 2026 19:07
bedanley
bedanley previously approved these changes Apr 14, 2026
@gingerknight gingerknight marked this pull request as draft April 16, 2026 17:16
@gingerknight gingerknight marked this pull request as ready for review April 16, 2026 20:14
@gingerknight gingerknight merged commit 8007a93 into develop Apr 21, 2026
8 checks passed
@gingerknight gingerknight deleted the feat/rag-evaluation branch April 21, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants