Production-ready starter for extracting structured data from documents using AWS and Claude. Automatically processes invoices, receipts, and contracts uploaded to S3, extracts key information via Claude, stores results in DynamoDB, and sends email notifications.
Built as a reference implementation by Three Moons Network — an AI consulting practice helping small businesses automate with production-grade systems.
┌─────────────────────────────────────────┐
│ AWS Cloud │
│ │
User uploads ───▶ │ S3 Bucket │
document │ │ │
│ ▼ (event notification) │
│ Lambda Processor │
│ │ │
│ ├──▶ Claude API (Extraction) │
│ │ │
│ ├──▶ DynamoDB (Store Results) │
│ │ │
│ └──▶ SES (Email Notification) │
│ │
│ CloudWatch (Logs + Alarms) │
│ │
└─────────────────────────────────────────┘
Upload a document (invoice, receipt, or contract) to S3. The Lambda function automatically:
- Detects document type (invoice, receipt, or contract)
- Reads document content from S3
- Passes content to Claude for structured extraction
- Stores extraction results in DynamoDB with timestamp and status
- Sends email notification with extracted data or error details
| Type | Extracts | Example |
|---|---|---|
| invoice | Vendor, amount, date, line items, payment terms | AWS invoice, SaaS subscription bill |
| receipt | Vendor, amount, date, category, payment method | Coffee shop receipt, grocery store receipt |
| contract | Parties, dates, key terms, obligations, renewal | NDA, service agreement, lease |
Input: PDF or image of a company invoice
Output (DynamoDB record):
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"document_type": "invoice",
"s3_key": "uploads/invoice-2024-01.pdf",
"extracted_data": {
"vendor": "ACME Corp",
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"due_date": "2024-02-15",
"amount": "1500.00",
"currency": "USD",
"line_items": [
{"description": "Consulting Services", "quantity": 10, "unit_price": "100.00", "total": "1000.00"},
{"description": "License Fee", "quantity": 1, "unit_price": "500.00", "total": "500.00"}
],
"total": "1500.00",
"payment_terms": "Net 30"
},
"extraction_status": "success",
"timestamp": "2024-01-15T14:30:00Z",
"processing_time_ms": 2340
}- AWS account with CLI configured
- Terraform >= 1.5
- Python 3.11+
- Anthropic API key (console.anthropic.com)
- SES verified email address (for notifications)
git clone git@github.com:Three-Moons-Network/ai-document-processor.git
cd ai-document-processor
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your API key, region, and SES email./scripts/deploy.shcd terraform
terraform init
terraform plan -out=tfplan
terraform apply tfplanTerraform outputs the S3 bucket name. Upload documents there:
S3_BUCKET=$(terraform output -raw s3_bucket_name)
# Upload an invoice
aws s3 cp invoice.pdf "s3://$S3_BUCKET/uploads/invoice-2024-01.pdf"
# Check DynamoDB for results
aws dynamodb scan --table-name ai-document-processor-dev-documentsterraform destroy├── src/
│ └── handler.py # Lambda handler — S3 trigger, Claude extraction, DynamoDB write, SES notify
├── tests/
│ └── test_handler.py # Unit tests with mocked AWS/Anthropic services
├── terraform/
│ ├── main.tf # All infra: S3, Lambda, DynamoDB, SES, IAM, CloudWatch
│ ├── outputs.tf # Bucket name, table name, function ARN
│ ├── backend.tf # Remote state config (commented for local use)
│ └── terraform.tfvars.example
├── scripts/
│ └── deploy.sh # Build Lambda zip package
├── .github/workflows/
│ └── ci.yml # Test, lint, TF validate, package
├── requirements.txt # Runtime: anthropic, boto3
└── requirements-dev.txt # Dev: pytest, ruff, moto
| Resource | Purpose |
|---|---|
| S3 Bucket | Document upload landing zone (versioning enabled) |
| Lambda (Python 3.11) | Document processor, 512MB / 60s defaults |
| DynamoDB Table | Stores extraction results with TTL support |
| SES Email Identity | Verified sender for notifications |
| CloudWatch Log Groups | Lambda logs + metrics |
| CloudWatch Alarms | Errors > 5 in 5min, p99 latency > 80% timeout, DynamoDB throttles |
| IAM Role + Policy | Least-privilege: S3 read, DynamoDB write, SES send, logs |
All resources tagged with Project, Environment, ManagedBy, and Owner for cost tracking and governance.
GitHub Actions runs on every push/PR to main:
- Test —
pytestwith mocked S3/DynamoDB/SES/Anthropic (no credentials needed) - Lint —
ruff format --check+ruff check - Terraform Validate —
fmt -check,init -backend=false,validate - Package — Builds
lambda.zipartifact on main branch merges
Add a new document type:
- Add type to
DOCUMENT_TYPESinhandler.py - Add extraction schema to
EXTRACTION_SCHEMAS - Add system prompt to
SYSTEM_PROMPTS - Update tests in
test_handler.py - Document type auto-detection uses S3 key prefix heuristics + content inspection
Switch models:
Set anthropic_model in tfvars:
terraform plan -var="anthropic_model=claude-opus-4-20250514" -out=tfplanAdd custom extraction fields:
Edit the schema JSON in EXTRACTION_SCHEMAS. Claude adapts automatically based on the schema structure.
Customize SES notifications:
Modify the send_notification() function in handler.py to change email template, recipients, or routing logic.
For low-volume document processing (< 100 documents/month):
| Component | Estimated Monthly Cost |
|---|---|
| Lambda | ~$0 (free tier: 1M requests, 400K GB-seconds) |
| S3 | ~$0 (free tier: 5GB storage + 20K GET/2K PUT) |
| DynamoDB | ~$0 (free tier: 25 write, 25 read units/sec) |
| SES | ~$0.10 (after free tier: $0.10 per 1,000 emails) |
| CloudWatch | ~$0.50 (log storage) |
| Anthropic API | Usage-based (~$3/M input tokens, ~$15/M output tokens for Sonnet) |
Total infrastructure: effectively free. Main cost is Anthropic API usage based on document complexity.
# Set up
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt
# Run tests
pytest tests/ -v
# Lint
ruff check src/ tests/
ruff format src/ tests/
# Test handler locally (requires API key)
export ANTHROPIC_API_KEY="sk-ant-..."
python -c "
from src.handler import extract_with_claude
result = extract_with_claude('Sample invoice text', 'invoice')
print(result)
"Lambda timeout during PDF processing:
- Increase
lambda_timeoutin terraform.tfvars - Consider pre-processing PDFs to extract text before upload
SES emails not delivered:
- Verify sender email in
ses_sender_emailis verified in SES - Check SES is in production mode (not sandbox)
- Review CloudWatch logs for
send_notificationerrors
Claude extraction returning incomplete data:
- Increase
max_tokensin terraform.tfvars - Check document image/text quality (OCR may fail on poor scans)
- Review CloudWatch logs for Claude API errors
DynamoDB write throttling:
- Terraform uses PAY_PER_REQUEST billing (auto-scales)
- If still throttling, switch to provisioned capacity in
main.tf
MIT
Charles Harvey (linuxlsr) — Three Moons Network LLC