AI Document Processor

Production-ready starter for extracting structured data from documents using AWS and Claude. Automatically processes invoices, receipts, and contracts uploaded to S3, extracts key information via Claude, stores results in DynamoDB, and sends email notifications.

Built as a reference implementation by Three Moons Network — an AI consulting practice helping small businesses automate with production-grade systems.

Architecture

                    ┌─────────────────────────────────────────┐
                    │              AWS Cloud                  │
                    │                                         │
  User uploads ───▶ │  S3 Bucket                              │
  document          │     │                                   │
                    │     ▼ (event notification)               │
                    │  Lambda Processor                       │
                    │     │                                   │
                    │     ├──▶ Claude API (Extraction)        │
                    │     │                                   │
                    │     ├──▶ DynamoDB (Store Results)       │
                    │     │                                   │
                    │     └──▶ SES (Email Notification)       │
                    │                                         │
                    │  CloudWatch (Logs + Alarms)             │
                    │                                         │
                    └─────────────────────────────────────────┘

What It Does

Upload a document (invoice, receipt, or contract) to S3. The Lambda function automatically:

Detects document type (invoice, receipt, or contract)
Reads document content from S3
Passes content to Claude for structured extraction
Stores extraction results in DynamoDB with timestamp and status
Sends email notification with extracted data or error details

Supported Document Types

Type	Extracts	Example
invoice	Vendor, amount, date, line items, payment terms	AWS invoice, SaaS subscription bill
receipt	Vendor, amount, date, category, payment method	Coffee shop receipt, grocery store receipt
contract	Parties, dates, key terms, obligations, renewal	NDA, service agreement, lease

Example Extraction (Invoice)

Input: PDF or image of a company invoice

Output (DynamoDB record):

{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "document_type": "invoice",
  "s3_key": "uploads/invoice-2024-01.pdf",
  "extracted_data": {
    "vendor": "ACME Corp",
    "invoice_number": "INV-2024-001",
    "date": "2024-01-15",
    "due_date": "2024-02-15",
    "amount": "1500.00",
    "currency": "USD",
    "line_items": [
      {"description": "Consulting Services", "quantity": 10, "unit_price": "100.00", "total": "1000.00"},
      {"description": "License Fee", "quantity": 1, "unit_price": "500.00", "total": "500.00"}
    ],
    "total": "1500.00",
    "payment_terms": "Net 30"
  },
  "extraction_status": "success",
  "timestamp": "2024-01-15T14:30:00Z",
  "processing_time_ms": 2340
}

Quick Start

Prerequisites

AWS account with CLI configured
Terraform >= 1.5
Python 3.11+
Anthropic API key (console.anthropic.com)
SES verified email address (for notifications)

1. Clone and configure

git clone git@github.com:Three-Moons-Network/ai-document-processor.git
cd ai-document-processor
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your API key, region, and SES email

2. Build the Lambda package

./scripts/deploy.sh

3. Deploy infrastructure

cd terraform
terraform init
terraform plan -out=tfplan
terraform apply tfplan

Terraform outputs the S3 bucket name. Upload documents there:

S3_BUCKET=$(terraform output -raw s3_bucket_name)

# Upload an invoice
aws s3 cp invoice.pdf "s3://$S3_BUCKET/uploads/invoice-2024-01.pdf"

# Check DynamoDB for results
aws dynamodb scan --table-name ai-document-processor-dev-documents

4. Tear down

terraform destroy

Project Structure

├── src/
│   └── handler.py            # Lambda handler — S3 trigger, Claude extraction, DynamoDB write, SES notify
├── tests/
│   └── test_handler.py       # Unit tests with mocked AWS/Anthropic services
├── terraform/
│   ├── main.tf               # All infra: S3, Lambda, DynamoDB, SES, IAM, CloudWatch
│   ├── outputs.tf            # Bucket name, table name, function ARN
│   ├── backend.tf            # Remote state config (commented for local use)
│   └── terraform.tfvars.example
├── scripts/
│   └── deploy.sh             # Build Lambda zip package
├── .github/workflows/
│   └── ci.yml                # Test, lint, TF validate, package
├── requirements.txt          # Runtime: anthropic, boto3
└── requirements-dev.txt      # Dev: pytest, ruff, moto

Infrastructure Details

Resource	Purpose
S3 Bucket	Document upload landing zone (versioning enabled)
Lambda (Python 3.11)	Document processor, 512MB / 60s defaults
DynamoDB Table	Stores extraction results with TTL support
SES Email Identity	Verified sender for notifications
CloudWatch Log Groups	Lambda logs + metrics
CloudWatch Alarms	Errors > 5 in 5min, p99 latency > 80% timeout, DynamoDB throttles
IAM Role + Policy	Least-privilege: S3 read, DynamoDB write, SES send, logs

All resources tagged with Project, Environment, ManagedBy, and Owner for cost tracking and governance.

CI/CD

GitHub Actions runs on every push/PR to main:

Test — pytest with mocked S3/DynamoDB/SES/Anthropic (no credentials needed)
Lint — ruff format --check + ruff check
Terraform Validate — fmt -check, init -backend=false, validate
Package — Builds lambda.zip artifact on main branch merges

Customization

Add a new document type:

Add type to DOCUMENT_TYPES in handler.py
Add extraction schema to EXTRACTION_SCHEMAS
Add system prompt to SYSTEM_PROMPTS
Update tests in test_handler.py
Document type auto-detection uses S3 key prefix heuristics + content inspection

Switch models:

Set anthropic_model in tfvars:

terraform plan -var="anthropic_model=claude-opus-4-20250514" -out=tfplan

Add custom extraction fields:

Edit the schema JSON in EXTRACTION_SCHEMAS. Claude adapts automatically based on the schema structure.

Customize SES notifications:

Modify the send_notification() function in handler.py to change email template, recipients, or routing logic.

Cost Estimate

For low-volume document processing (< 100 documents/month):

Component	Estimated Monthly Cost
Lambda	~$0 (free tier: 1M requests, 400K GB-seconds)
S3	~$0 (free tier: 5GB storage + 20K GET/2K PUT)
DynamoDB	~$0 (free tier: 25 write, 25 read units/sec)
SES	~$0.10 (after free tier: $0.10 per 1,000 emails)
CloudWatch	~$0.50 (log storage)
Anthropic API	Usage-based (~$3/M input tokens, ~$15/M output tokens for Sonnet)

Total infrastructure: effectively free. Main cost is Anthropic API usage based on document complexity.

Local Development

# Set up
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

# Test handler locally (requires API key)
export ANTHROPIC_API_KEY="sk-ant-..."
python -c "
from src.handler import extract_with_claude
result = extract_with_claude('Sample invoice text', 'invoice')
print(result)
"

Troubleshooting

Lambda timeout during PDF processing:

Increase lambda_timeout in terraform.tfvars
Consider pre-processing PDFs to extract text before upload

SES emails not delivered:

Verify sender email in ses_sender_email is verified in SES
Check SES is in production mode (not sandbox)
Review CloudWatch logs for send_notification errors

Claude extraction returning incomplete data:

Increase max_tokens in terraform.tfvars
Check document image/text quality (OCR may fail on poor scans)
Review CloudWatch logs for Claude API errors

DynamoDB write throttling:

Terraform uses PAY_PER_REQUEST billing (auto-scales)
If still throttling, switch to provisioned capacity in main.tf

License

MIT

Author

Charles Harvey (linuxlsr) — Three Moons Network LLC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Document Processor

Architecture

What It Does

Supported Document Types

Example Extraction (Invoice)

Quick Start

Prerequisites

1. Clone and configure

2. Build the Lambda package

3. Deploy infrastructure

4. Tear down

Project Structure

Infrastructure Details

CI/CD

Customization

Cost Estimate

Local Development

Troubleshooting

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
terraform		terraform
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI Document Processor

Architecture

What It Does

Supported Document Types

Example Extraction (Invoice)

Quick Start

Prerequisites

1. Clone and configure

2. Build the Lambda package

3. Deploy infrastructure

4. Tear down

Project Structure

Infrastructure Details

CI/CD

Customization

Cost Estimate

Local Development

Troubleshooting

License

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages