Skip to content

Three-Moons-Network/ai-document-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Document Processor

Production-ready starter for extracting structured data from documents using AWS and Claude. Automatically processes invoices, receipts, and contracts uploaded to S3, extracts key information via Claude, stores results in DynamoDB, and sends email notifications.

Built as a reference implementation by Three Moons Network — an AI consulting practice helping small businesses automate with production-grade systems.

Architecture

                    ┌─────────────────────────────────────────┐
                    │              AWS Cloud                  │
                    │                                         │
  User uploads ───▶ │  S3 Bucket                              │
  document          │     │                                   │
                    │     ▼ (event notification)               │
                    │  Lambda Processor                       │
                    │     │                                   │
                    │     ├──▶ Claude API (Extraction)        │
                    │     │                                   │
                    │     ├──▶ DynamoDB (Store Results)       │
                    │     │                                   │
                    │     └──▶ SES (Email Notification)       │
                    │                                         │
                    │  CloudWatch (Logs + Alarms)             │
                    │                                         │
                    └─────────────────────────────────────────┘

What It Does

Upload a document (invoice, receipt, or contract) to S3. The Lambda function automatically:

  1. Detects document type (invoice, receipt, or contract)
  2. Reads document content from S3
  3. Passes content to Claude for structured extraction
  4. Stores extraction results in DynamoDB with timestamp and status
  5. Sends email notification with extracted data or error details

Supported Document Types

Type Extracts Example
invoice Vendor, amount, date, line items, payment terms AWS invoice, SaaS subscription bill
receipt Vendor, amount, date, category, payment method Coffee shop receipt, grocery store receipt
contract Parties, dates, key terms, obligations, renewal NDA, service agreement, lease

Example Extraction (Invoice)

Input: PDF or image of a company invoice

Output (DynamoDB record):

{
  "document_id": "550e8400-e29b-41d4-a716-446655440000",
  "document_type": "invoice",
  "s3_key": "uploads/invoice-2024-01.pdf",
  "extracted_data": {
    "vendor": "ACME Corp",
    "invoice_number": "INV-2024-001",
    "date": "2024-01-15",
    "due_date": "2024-02-15",
    "amount": "1500.00",
    "currency": "USD",
    "line_items": [
      {"description": "Consulting Services", "quantity": 10, "unit_price": "100.00", "total": "1000.00"},
      {"description": "License Fee", "quantity": 1, "unit_price": "500.00", "total": "500.00"}
    ],
    "total": "1500.00",
    "payment_terms": "Net 30"
  },
  "extraction_status": "success",
  "timestamp": "2024-01-15T14:30:00Z",
  "processing_time_ms": 2340
}

Quick Start

Prerequisites

  • AWS account with CLI configured
  • Terraform >= 1.5
  • Python 3.11+
  • Anthropic API key (console.anthropic.com)
  • SES verified email address (for notifications)

1. Clone and configure

git clone git@github.com:Three-Moons-Network/ai-document-processor.git
cd ai-document-processor
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# Edit terraform.tfvars with your API key, region, and SES email

2. Build the Lambda package

./scripts/deploy.sh

3. Deploy infrastructure

cd terraform
terraform init
terraform plan -out=tfplan
terraform apply tfplan

Terraform outputs the S3 bucket name. Upload documents there:

S3_BUCKET=$(terraform output -raw s3_bucket_name)

# Upload an invoice
aws s3 cp invoice.pdf "s3://$S3_BUCKET/uploads/invoice-2024-01.pdf"

# Check DynamoDB for results
aws dynamodb scan --table-name ai-document-processor-dev-documents

4. Tear down

terraform destroy

Project Structure

├── src/
│   └── handler.py            # Lambda handler — S3 trigger, Claude extraction, DynamoDB write, SES notify
├── tests/
│   └── test_handler.py       # Unit tests with mocked AWS/Anthropic services
├── terraform/
│   ├── main.tf               # All infra: S3, Lambda, DynamoDB, SES, IAM, CloudWatch
│   ├── outputs.tf            # Bucket name, table name, function ARN
│   ├── backend.tf            # Remote state config (commented for local use)
│   └── terraform.tfvars.example
├── scripts/
│   └── deploy.sh             # Build Lambda zip package
├── .github/workflows/
│   └── ci.yml                # Test, lint, TF validate, package
├── requirements.txt          # Runtime: anthropic, boto3
└── requirements-dev.txt      # Dev: pytest, ruff, moto

Infrastructure Details

Resource Purpose
S3 Bucket Document upload landing zone (versioning enabled)
Lambda (Python 3.11) Document processor, 512MB / 60s defaults
DynamoDB Table Stores extraction results with TTL support
SES Email Identity Verified sender for notifications
CloudWatch Log Groups Lambda logs + metrics
CloudWatch Alarms Errors > 5 in 5min, p99 latency > 80% timeout, DynamoDB throttles
IAM Role + Policy Least-privilege: S3 read, DynamoDB write, SES send, logs

All resources tagged with Project, Environment, ManagedBy, and Owner for cost tracking and governance.

CI/CD

GitHub Actions runs on every push/PR to main:

  • Testpytest with mocked S3/DynamoDB/SES/Anthropic (no credentials needed)
  • Lintruff format --check + ruff check
  • Terraform Validatefmt -check, init -backend=false, validate
  • Package — Builds lambda.zip artifact on main branch merges

Customization

Add a new document type:

  1. Add type to DOCUMENT_TYPES in handler.py
  2. Add extraction schema to EXTRACTION_SCHEMAS
  3. Add system prompt to SYSTEM_PROMPTS
  4. Update tests in test_handler.py
  5. Document type auto-detection uses S3 key prefix heuristics + content inspection

Switch models:

Set anthropic_model in tfvars:

terraform plan -var="anthropic_model=claude-opus-4-20250514" -out=tfplan

Add custom extraction fields:

Edit the schema JSON in EXTRACTION_SCHEMAS. Claude adapts automatically based on the schema structure.

Customize SES notifications:

Modify the send_notification() function in handler.py to change email template, recipients, or routing logic.

Cost Estimate

For low-volume document processing (< 100 documents/month):

Component Estimated Monthly Cost
Lambda ~$0 (free tier: 1M requests, 400K GB-seconds)
S3 ~$0 (free tier: 5GB storage + 20K GET/2K PUT)
DynamoDB ~$0 (free tier: 25 write, 25 read units/sec)
SES ~$0.10 (after free tier: $0.10 per 1,000 emails)
CloudWatch ~$0.50 (log storage)
Anthropic API Usage-based (~$3/M input tokens, ~$15/M output tokens for Sonnet)

Total infrastructure: effectively free. Main cost is Anthropic API usage based on document complexity.

Local Development

# Set up
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt -r requirements-dev.txt

# Run tests
pytest tests/ -v

# Lint
ruff check src/ tests/
ruff format src/ tests/

# Test handler locally (requires API key)
export ANTHROPIC_API_KEY="sk-ant-..."
python -c "
from src.handler import extract_with_claude
result = extract_with_claude('Sample invoice text', 'invoice')
print(result)
"

Troubleshooting

Lambda timeout during PDF processing:

  • Increase lambda_timeout in terraform.tfvars
  • Consider pre-processing PDFs to extract text before upload

SES emails not delivered:

  • Verify sender email in ses_sender_email is verified in SES
  • Check SES is in production mode (not sandbox)
  • Review CloudWatch logs for send_notification errors

Claude extraction returning incomplete data:

  • Increase max_tokens in terraform.tfvars
  • Check document image/text quality (OCR may fail on poor scans)
  • Review CloudWatch logs for Claude API errors

DynamoDB write throttling:

  • Terraform uses PAY_PER_REQUEST billing (auto-scales)
  • If still throttling, switch to provisioned capacity in main.tf

License

MIT

Author

Charles Harvey (linuxlsr) — Three Moons Network LLC

About

AI-powered document extraction pipeline — S3 upload triggers Claude to extract structured data from invoices, receipts, and contracts. Terraform + Python + GitHub Actions.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors