Comprehensive retrieval quality evaluation for LISA RAG backends using precision, recall, and NDCG metrics.
The RAG evaluation suite measures how well your RAG system retrieves relevant documents for user queries. It supports:
- Bedrock Knowledge Bases - Evaluate AWS Bedrock KB retrieval quality
- LISA API Backends - Evaluate OpenSearch, PGVector, or other LISA-hosted vector stores
- Cross-Backend Comparison - Compare multiple backends side-by-side
- Multiple Metrics - Precision@k, Recall@k, NDCG@k (Normalized Discounted Cumulative Gain)
- LISA Deployment with RAG enabled (OpenSearch, PGVector, or Bedrock KB)
- AWS Credentials configured with access to:
- AWS Secrets Manager (for LISA management keys)
- DynamoDB (for token registration)
- Bedrock (if evaluating Bedrock KB)
- Python Environment with LISA SDK installed:
source .venv/bin/activate # Activate LISA venv
-
Create your config file:
cd test/integration/rag/eval_datasets cp eval_config.example.yaml eval_config.yaml -
Edit
eval_config.yamlwith your deployment details:- AWS region
- API Gateway URLs
- Knowledge Base IDs
- S3 bucket paths
- Repository and collection IDs
-
Create your golden dataset:
cp golden-dataset.example.jsonl golden-dataset.jsonl
-
Edit
golden-dataset.jsonlwith your test queries (see Golden Dataset Format)
# From repo root
python -m lisapy.evaluation \
--config test/integration/rag/eval_datasets/eval_config.yaml \
--dataset test/integration/rag/eval_datasets/golden-dataset.jsonlWith verbose logging:
python -m lisapy.evaluation \
--config test/integration/rag/eval_datasets/eval_config.yaml \
--dataset test/integration/rag/eval_datasets/golden-dataset.jsonl \
--verboseregion: us-east-1 # AWS region
k: 5 # Evaluate top-k results
# Document registry: short names used in golden dataset
documents:
doc1: "path/to/document1.pdf"
doc2: "path/to/document2.pdf"
backends:
# Bedrock Knowledge Base
bedrock_kb:
- name: "Bedrock KB Production"
knowledge_base_id: "ABCDEFGHIJ"
s3_bucket: "s3://kb-data-bucket"
# LISA API backends (OpenSearch, PGVector)
lisa_api:
- name: "OpenSearch Production"
api_url: "https://lisa-rest-api-endpoint/STAGE"
deployment_name: "your-deployment-name"
repo_id: "opensearch-repo"
collection_id: "default"
s3_bucket: "s3://docs-bucket"Key Fields:
| Field | Description |
|---|---|
region |
AWS region for your LISA deployment and Bedrock KB |
k |
Number of top results to evaluate (Precision@k, Recall@k, NDCG@k) |
documents |
Short name → filename mapping. Document keys referenced in golden dataset |
s3_bucket |
S3 bucket prefix. Combined with documents filenames to build full URIs |
api_url |
LISA API Gateway URL (find in CloudFormation outputs or AWS Console) |
deployment_name |
LISA deployment name used for authentication |
knowledge_base_id |
Bedrock Knowledge Base ID (find in Bedrock console) |
Finding Your Values:
- API Gateway URL: CloudFormation → Your LISA stack → Outputs →
RestApiUriorApiUri - Deployment Name: The value you used for
deploymentNameinconfig-custom.yaml - Repository ID: Check LISA UI → RAG Repositories, or via API:
GET /repository - Knowledge Base ID: AWS Console → Bedrock → Knowledge bases
You can evaluate just one backend by configuring only that section:
OpenSearch Only:
backends:
lisa_api:
- name: "OpenSearch"
# ... configBedrock KB Only:
backends:
bedrock_kb:
- name: "Bedrock KB"
# ... configConfigure multiple backends to get a comparison report:
backends:
bedrock_kb:
- name: "Bedrock KB"
# ...
lisa_api:
- name: "OpenSearch"
# ...
- name: "PGVector"
# ...This generates individual reports plus a cross-backend comparison table.
The golden dataset is a JSONL file (one JSON object per line) with your test queries and expected results.
{
"query": "Your search query text",
"expected": ["doc1", "doc2"],
"relevance": {"doc1": 3, "doc2": 2},
"type": "semantic"
}Fields:
| Field | Required | Description |
|---|---|---|
query |
✓ | The search query text |
expected |
✓ | List of document keys that should be retrieved (from documents in config) |
relevance |
✓ | Relevance scores for each expected document (1-3, where 3 = most relevant) |
type |
Optional | Query type for breakdown analysis (e.g., "semantic", "keyword", "exact") |
{"query": "How to optimize neural networks?", "expected": ["nn_paper", "opt_guide"], "relevance": {"nn_paper": 3, "opt_guide": 2}, "type": "semantic"}
{"query": "machine learning regularization techniques", "expected": ["regularization_paper"], "relevance": {"regularization_paper": 3}, "type": "keyword"}
{"query": "report.pdf", "expected": ["report"], "relevance": {"report": 3}, "type": "exact"}-
Diverse Query Types:
- Semantic: Conceptual questions that require understanding meaning
- Keyword: Direct keyword matches
- Exact: Filename or precise phrase searches
-
Relevance Scores:
- 3 = Highly relevant, directly answers the query
- 2 = Moderately relevant, provides useful context
- 1 = Marginally relevant, tangentially related
-
Coverage:
- Test both common and edge-case queries
- Include queries with 0, 1, and multiple expected documents
- Cover different document types and topics
Authentication uses AWS Secrets Manager for management keys:
-
Evaluation tool fetches management key from Secrets Manager using these patterns:
{deployment_name}-lisa-management-key{deployment_name}-management-keylisa-{deployment_name}-management-key
-
Token is registered in DynamoDB:
{deployment_name}-LISAApiBaseTokenTable -
Authenticated requests use both
Api-KeyandAuthorizationheaders
Required IAM Permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:REGION:ACCOUNT:secret:*management-key*"
},
{
"Effect": "Allow",
"Action": [
"dynamodb:PutItem"
],
"Resource": "arn:aws:dynamodb:REGION:ACCOUNT:table/*-LISAApiBaseTokenTable"
}
]
}Uses standard AWS SDK authentication (boto3 default credential chain).
Required IAM Permissions:
{
"Effect": "Allow",
"Action": [
"bedrock:Retrieve"
],
"Resource": "arn:aws:bedrock:REGION:ACCOUNT:knowledge-base/*"
}Precision@k:
- Measures: What fraction of retrieved documents are relevant?
- Formula: (Relevant Retrieved) / k
- Range: 0.0 to 1.0 (higher is better)
- Example: If k=5 and 3 retrieved docs are relevant → Precision@5 = 0.6
Recall@k:
- Measures: What fraction of relevant documents were retrieved?
- Formula: (Relevant Retrieved) / (Total Relevant)
- Range: 0.0 to 1.0 (higher is better)
- Example: If 3 relevant docs exist and 2 were retrieved → Recall = 0.67
NDCG@k (Normalized Discounted Cumulative Gain):
- Measures: Ranking quality (relevant docs should rank higher)
- Penalizes relevant documents that appear lower in results
- Range: 0.0 to 1.0 (higher is better)
- Perfect score (1.0) = all relevant docs retrieved in order of relevance
======================================================================
OpenSearch — Evaluation Results (k=5)
======================================================================
Precision@5: 0.742
Recall@5: 0.856
NDCG@5: 0.821
By Query Type:
Type Count P@5 R@5 NDCG@5
------------ ----- -------- -------- --------
semantic 42 0.714 0.833 0.798
keyword 15 0.800 0.900 0.867
exact 8 0.775 0.875 0.845
Per-Query Breakdown:
Type Query P R NDCG Retrieved
---------- ----------------------------------------------- ----- ----- ------ ------------------------------
semantic How to optimize neural networks? 1.00 1.00 1.00 nn_paper.pdf, opt_guide.pdf
keyword machine learning regularization 0.80 0.80 0.92 regularization_paper.pdf, ...
...
When evaluating multiple backends:
======================================================================
Cross-Backend Comparison (k=5)
======================================================================
Metric OpenSearch PGVector Bedrock KB
--------------- ------------ ------------ ------------
precision@5 0.742 0.698 0.755
recall@5 0.856 0.812 0.867
ndcg@5 0.821 0.784 0.835
Pairwise Deltas:
Comparison P@5 R@5 NDCG@5
---------------------------- -------- -------- --------
PGVector vs OpenSearch -0.044 -0.044 -0.037
Bedrock KB vs OpenSearch +0.013 +0.011 +0.014
Bedrock KB vs PGVector +0.057 +0.055 +0.051
Error: FileNotFoundError
- Fix: Use absolute paths or run from repo root
Error: ValidationError: region field required
- Fix: Ensure
region:is set in your config file
Error: ValidationError: documents field required
- Fix: Must define at least one document in
documents:section
Error: Repository not found
- Fix: Verify
repo_idmatches an existing repository. List repos:curl -H "Authorization: YOUR_TOKEN" \ https://YOUR-API-URL/repository
Error: Bedrock knowledge base not found
- Fix: Verify
knowledge_base_idis correct. List KBs:aws bedrock-agent list-knowledge-bases
Error: S3 object not found
- Fix: Documents must exist at
{s3_bucket}/{filename}. Verify with:aws s3 ls s3://your-bucket/ --recursive