Technical reference for developers working on the LLM Evaluation Platform.
If you're changing eval_mcp code (tools, server, viewer) and want to run your changes against your own IDE.
End-users install via uvx --from llm-evaluation-system eval-mcp — do not iterate against that, it pulls the published package from uvx's cache. For development, use these two layers:
- pytest against narrow, deterministic logic (regex, parsing, validation) — milliseconds, catches a specific class of regression cheaply. Not a substitute for end-to-end coverage because the tools spawn real subprocesses, call Bedrock, and write user dirs — mocks give false green builds.
- Claude Code / Desktop pointed at your local build — the primary integration test. This is the ground truth: you exercise the exact same path users will get when you publish. Plan for this to be the main way you verify changes.
git clone https://github.com/awslabs/llm-evaluation-system.git
cd llm-evaluation-system
uv venv
uv pip install -e .-e installs editable — edits in eval_mcp/ take effect on next MCP restart without reinstalling.
For pure functions (regex parsing, validation, config munging), a handler-level pytest is a cheap safety net. The official MCP servers repo (github.com/modelcontextprotocol/servers) imports handlers directly — no server boot, no transport mocking. Same pattern here:
# tests/test_run_eval.py
from eval_mcp.tools.run_eval import _validate_providers
async def test_invalid_bedrock_id_fails_validation():
result = await _validate_providers(["bedrock/nonexistent-model"])
assert result["valid"] is False
assert "Invalid model ID" in result["failed_providers"][0]["error"]Run with .venv/bin/pytest tests/. Useful for pinning down specific regressions after you hit them — not useful as a proof that a feature works end-to-end.
Mental model: as a maintainer you want your IDE running your code, not PyPI's. Published users get the uvx install; you get the editable install. This divergence is the point — it's how you see unreleased changes before shipping them. Same pattern as any Python package: pip install -e . locally, published version for everyone else.
Configure it once: edit ~/.claude.json and change the eval MCP entry so command points at your editable install's binary. Leave it this way for as long as you're developing on the MCP. No swap-back-when-done step.
"eval": {
"type": "stdio",
"command": "/absolute/path/to/llm-evaluation-system/.venv/bin/eval-mcp",
"env": {}
}Your dev loop after that:
- Edit code in
eval_mcp/*.py. /mcp→ reconnecteval. That restarts the subprocess, which picks up your edits via the editable install. No reinstall, no IDE restart.- Call the tool and verify behavior.
That's the entire loop. Reconnect is your "refresh from local." You don't need a dual-entry eval + eval-dev setup — one entry, pointed at local, is the simplest thing that works and matches Anthropic's MCP quickstart pattern for developing servers.
When you ship a release, users on uvx --from llm-evaluation-system eval-mcp pick it up when they run uv cache clean. Your local setup is unaffected — you stay on whatever is in your working tree. If you need to sanity-check the published version behaves as expected (rare), swap the entry temporarily back to uvx --from llm-evaluation-system eval-mcp or run uvx --from llm-evaluation-system eval-mcp --help in a terminal.
The viewer is a pre-built Next.js export served by the Python viewer app. When you change frontend/ source:
cd frontend
npm install # first time only
npm run build:viewerThis compiles the frontend and copies the static output into eval_mcp/viewer_static/. The viewer picks it up on next eval-mcp view.
.venv/bin/eval-mcp view # results viewer on :4001
.venv/bin/inspect eval <task.py> # run an eval directly via Inspect AI CLIReleases are tag-triggered via GitHub Actions (.github/workflows/publish.yml). When you push a tag matching v*, CI rebuilds the viewer frontend, builds the Python wheel, verifies the tag matches pyproject.toml, and publishes to PyPI via trusted publishing (OIDC — no token juggling).
Release steps:
# 1. Bump version in pyproject.toml (patch/minor/major per SemVer)
# 2. Sync the lockfile
uv lock
# 3. Commit the version bump + lock + any release-note changes
git add pyproject.toml uv.lock
git commit -m "Release vX.Y.Z: <summary>"
# 4. Push + tag. CI takes over.
git push origin main
git tag vX.Y.Z
git push origin vX.Y.ZWatch the release run:
gh run list --workflow=publish.yml --limit 1
gh run watch <run-id>Verify from a clean environment after the run completes:
uvx --refresh --from 'llm-evaluation-system==X.Y.Z' eval-mcp --helpRelease discipline:
- Users install via plain
uvx --from llm-evaluation-system eval-mcp(cached per user). They won't pick up a new release until they explicitly runuv cache clean llm-evaluation-system— some natural insulation, but assume any published version can reach users at any time. - Every main-branch commit should be releasable. Land behind a CI gate (pytest + lint + type-check).
- Bump the SemVer appropriately: patch for bug fixes, minor for additive changes, major for breaking tool-signature changes.
- Do not instruct users to put
@latestin their IDE config — it forces PyPI resolution on every MCP start (~20s cold start) and causes "disconnected" state on first connect after every release. See README "Upgrading" for the correct upgrade path. uv publishlocally is possible but not recommended — it bypasses the CI viewer rebuild and skips the tag-version-matches check. Only reach for it in emergencies.
- Implement the handler in
eval_mcp/tools/<name>.pyas an async function. - Register it in
eval_mcp/server.pywith a typed signature. The docstring is the tool description the LLM sees — keep it specific about expected ID formats, required prerequisites, and failure modes. - (Optional) Write a pytest case for any narrow deterministic logic (parsing, validation) so regressions get caught cheaply next time.
- Point Claude Code at your local build (see section 3) and exercise the tool end-to-end through the IDE before publishing.
The repo root Dockerfile builds a slim container that runs eval-mcp serve as an HTTP MCP (for self-hosting on EC2/ECS/AgentCore). Local dev rarely needs this — use the editable install above.
docker build -t eval-mcp .
docker run -p 8002:8002 \
-e AWS_REGION=us-west-2 \
-v ~/.aws:/root/.aws:ro \
eval-mcpMCP listens at http://localhost:8002/mcp.
If you're changing the EKS web app (FastAPI + Next.js chat) and want hot reload for all services:
AWS_PROFILE=my-profile make dev # from repo rootOpens http://localhost:4001. Each service hot-reloads independently; see local/README.md for commands (make logs, make restart s=backend, etc.) and the architecture diagram.
┌─────────────────────────────────────────┐
│ CloudFront + WAF → Internal ALB │
└───────────────────┬─────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Frontend │ │ Backend │ │ MCP Servers │
│ (Next.js) │ │ (FastAPI) │ │ - synthetic │
│ │ │ - Chat/Agent │ │ - providers │
└─────────────────┘ │ - Viewer Proxy │ │ - dataset │
└────────┬────────┘ └─────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User A │ │ User B │ │ User C │
│ /data/users/ │ │ /data/users/ │ │ /data/users/ │
│ - configs │ │ - configs │ │ - configs │
│ - datasets │ │ - datasets │ │ - datasets │
│ - viewer │ │ - viewer │ │ - viewer │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└──────────────────────┼──────────────────────┘
│
┌────────────▼────────────┐
│ JSON + EBS Storage │
│ Periodic S3 backup │
└─────────────────────────┘
Internet → CloudFront (HTTPS + WAF) → VPC Origin → Internal ALB → Pods
- Internal ALB: Not internet-facing. Only reachable via CloudFront VPC Origins.
- VPC Origins: CloudFront connects to ALB via AWS private network.
- WAF: Rate limiting (2000 req/5min per IP), AWS managed rules.
- Authentication: Cognito (native or external OIDC IdP) via oauth2-proxy.
- Secrets: Stored in AWS Secrets Manager, synced to K8s via External Secrets Operator.
- Database auth: IAM token authentication (no static passwords).
The repo hosts two deployable things that share some code:
eval_mcp/— the MCP package published to PyPI asllm-evaluation-system. This is what the README's Install section wires up. Self-contained; no database or web app.backend/+frontend/— the full EKS web app (FastAPI + Next.js chat UI + Cognito auth)../deploy.shis its entry point.
llm-evaluation-system/
├── eval_mcp/ # MCP package (published to PyPI)
│ ├── cli.py # `eval-mcp` CLI entry point
│ ├── server.py # FastMCP server registering all tools
│ ├── core/ # Shared utilities (bedrock client, storage, pricing, ...)
│ ├── tools/ # MCP tool handlers (run_eval, generate_qa, ...)
│ ├── bedrock_capture.py # OTel-based Bedrock call capture for agent evals
│ ├── s3_sync.py # Optional team-sharing via S3
│ ├── viewer.py # Local results viewer (FastAPI)
│ └── viewer_static/ # Pre-built Next.js static export (rebuilt via `npm run build:viewer`)
├── backend/ # EKS web app (not used by the MCP)
│ ├── api/ # FastAPI routes (chat, compare, auth)
│ └── core/ # Chat agent, database, mcp_client
├── frontend/ # Next.js source for both the web app AND the viewer static export
├── docker/ # Dockerfiles (web app + MCP serve mode)
├── helm/ # EKS Helm chart + external-secrets config
├── infra/
│ ├── data/ # Terraform: persistent layer (VPC, RDS, S3) for the web app
│ ├── platform/ # Terraform: compute layer (EKS, CloudFront, ALB) for the web app
│ └── modules/eval-logs-bucket/ # Terraform: optional S3 bucket for MCP team sharing
├── deploy.sh / destroy.sh # Web-app deployment
├── manage-users.sh # Web-app Cognito user admin
└── docs/DEVELOPMENT.md # This file
The EKS/web-app content below is the heavyweight deployment path. For the MCP, the section above ("Working on the MCP locally") is what you want.
Infrastructure is split into two Terraform layers with separate state:
Persistent resources that survive ./destroy.sh:
- VPC — 10.0.0.0/16 CIDR, 2 AZs, public/private/intra subnets, NAT gateway, S3 VPC endpoint
- RDS PostgreSQL — db.t3.micro, 20GB (auto-scales to 100GB), IAM auth enabled
- S3 Documents Bucket — User uploads, versioned
- S3 Backup Bucket — Periodic JSON backups, versioned
Compute and networking resources destroyed/recreated by scripts:
- EKS — Kubernetes 1.34, 2x t4g.medium managed node group (ARM64/Graviton)
- Karpenter — Auto-scaling with c/m/r/t instance types (arm64, medium-xlarge)
- ALB — Internal (non-internet-facing), targets: backend (8080), frontend (3000), oauth2-proxy (4180)
- CloudFront — HTTPS CDN with VPC Origin to internal ALB, HTTP/2+3
- WAF — Rate limiting (2000 req/5min per IP), AWS managed rules
- Cognito — User Pool with admin-only signup, optional external OIDC IdP
- CodeBuild — ARM64 image builds, source from S3
- ECR — Private container registry
- Bedrock Logging — Multi-region (us-west-2, us-east-1, us-east-2) CloudWatch logs
- Kubernetes resources — Namespace, ConfigMap, StorageClass, External Secrets, Pod Identity, TGBs
The deploy script reads data-layer outputs and passes them as -var flags to the platform layer. This avoids terraform_remote_state (which exposes entire state including secrets).
# deploy.sh internally does:
VPC_ID=$(cd infra/data && terraform output -raw vpc_id)
# ... ~13 variables total
cd infra/platform && terraform apply -var="vpc_id=$VPC_ID" ...To use an external IdP (Okta, Azure AD, etc.) instead of Cognito native auth, edit infra/platform/terraform.tfvars before deploying:
enable_oidc_idp = true
oidc_provider_name = "YourIdP"
oidc_client_id = "your-client-id"
oidc_client_secret_arn = "arn:aws:secretsmanager:..."
oidc_issuer_url = "https://your-idp.example.com"helm/eval/
├── Chart.yaml # Chart metadata (depends on oauth2-proxy subchart)
├── values.yaml # Shared defaults (multi-environment)
├── values-aws.yaml # AWS EKS overrides
└── templates/
├── deployment.yaml # Backend + Frontend + 3 MCP sidecars + s3-backup
├── service.yaml # Backend, Frontend services
├── pvc.yaml # Backend EBS gp3 storage (200Gi)
├── hpa.yaml # Horizontal Pod Autoscaling
├── pdb.yaml # Pod Disruption Budgets
└── rbac.yaml # RBAC configuration
Helm --set values required for AWS:
aws.region— AWS regionaws.accountId— AWS account IDprojectName— Project name with region suffix (e.g.,eval-managed-uswest2)
| Variable | Default | Description |
|---|---|---|
region |
us-west-2 |
AWS region |
project_name |
eval-managed |
Resource name prefix |
Inherits region and project_name, plus:
| Variable | Default | Description |
|---|---|---|
bedrock_cross_regions |
["us-east-1", "us-east-2"] |
Regions for Bedrock cross-region inference logging |
eks_cluster_version |
1.34 |
Kubernetes version |
cluster_admin_role_arns |
[] |
Additional IAM roles for EKS admin |
enable_oidc_idp |
false |
Use external OIDC identity provider |
oidc_provider_name |
ExternalOIDC |
OIDC provider name in Cognito |
oidc_client_id |
"" |
OIDC client ID |
oidc_client_secret_arn |
"" |
Secrets Manager ARN for OIDC client secret |
oidc_issuer_url |
"" |
OIDC issuer URL |
vpc_id |
(required) | From data layer output |
private_subnets |
(required) | From data layer output |
| ... | (13 data-layer variables total, passed by deploy.sh) |
| Path | Service | Description |
|---|---|---|
/api/auth/* |
frontend | NextAuth authentication |
/api/* |
backend | FastAPI endpoints |
/viewer/* |
backend | Per-user viewer proxy |
/health |
backend | Health check |
/* |
frontend | Next.js app |
Public routes (no auth): /, /_next/*, /favicon.ico — bypassed at ALB level.
Protected routes: /chat, /api/*, /viewer/* — through oauth2-proxy.
For developers who want to run each phase individually instead of using ./deploy.sh:
cd infra/data
# Create terraform.tfvars with your region
echo 'region = "us-west-2"' > terraform.tfvars
terraform init
terraform apply
cd ../platform
# Create terraform.tfvars with any overrides (optional — defaults work for most setups)
# See infra/platform/variables.tf for all available options
echo 'region = "us-west-2"' > terraform.tfvars
terraform init
terraform apply \
-var="vpc_id=$(cd ../data && terraform output -raw vpc_id)" \
-var="private_subnets=$(cd ../data && terraform output -json private_subnets)" \
# ... (see deploy.sh for all 13 variables)aws eks update-kubeconfig --name eval-managed --region us-west-2# Upload source and build images
rm -f /tmp/source.zip && zip -r /tmp/source.zip . -x "*.git*" -x "*/node_modules/*" -x "*/.next/*" -x "*.terraform*"
aws s3 cp /tmp/source.zip s3://$(cd infra/platform && terraform output -raw source_bucket)/source.zip
aws codebuild start-build --project-name $(cd infra/platform && terraform output -raw codebuild_project)
# Check build status
aws codebuild list-builds-for-project \
--project-name eval-managed-image-build \
--query 'ids[0]' --output text | \
xargs -I {} aws codebuild batch-get-builds --ids {} \
--query 'builds[0].buildStatus' --output text
# Deploy with Helm
export AWS_REGION=us-west-2
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION_SUFFIX=$(echo $AWS_REGION | tr -d '-')
helm upgrade --install eval ./helm/eval -n eval-managed \
-f ./helm/eval/values-aws.yaml \
--set aws.region=$AWS_REGION \
--set aws.accountId=$AWS_ACCOUNT_ID \
--set projectName=eval-managed-${REGION_SUFFIX}After code changes:
# Upload and rebuild
rm -f /tmp/source.zip && zip -r /tmp/source.zip . -x "*.git*" -x "*/node_modules/*" -x "*/.next/*" -x "*.terraform*"
aws s3 cp /tmp/source.zip s3://$(cd infra/platform && terraform output -raw source_bucket)/source.zip
aws codebuild start-build --project-name eval-managed-image-build
# After build completes, restart pods to pull new images
kubectl rollout restart deployment/backend -n eval-managed
kubectl rollout restart deployment/frontend -n eval-managedkubectl describe pod -l app=backend -n eval-managed
kubectl logs -l app=backend -n eval-managed --previous# See what values are being used
helm get values eval -n eval-managed
# See rendered templates
helm template eval ./helm/eval -f ./helm/eval/values-aws.yaml \
--set aws.region=$AWS_REGION --set aws.accountId=$AWS_ACCOUNT_IDBackend uses single-pod architecture with EBS gp3 storage (200Gi). Sidecar backs up JSON databases to S3 every 15 minutes. Brief downtime (~30s) occurs during deploys.
# Check disk usage
kubectl exec -n eval-managed deployment/backend -- df -h /data
# Expand storage (no downtime, applies automatically)
kubectl patch pvc backend-data -n eval-managed \
-p '{"spec":{"resources":{"requests":{"storage":"400Gi"}}}}'
# Check backup status
kubectl logs -n eval-managed -l app=backend -c s3-backupEach layer has independent state in its own directory:
# Data layer state
cd infra/data && terraform state list
# Platform layer state
cd infra/platform && terraform state list- Dockerfile changes
- New Python dependencies (pyproject.toml)