Cloud Deployment Guide

This guide covers deploying the lakehouse stack on AWS using managed services. The same patterns apply to other cloud providers (GCP, Azure) with equivalent services.

Architecture Comparison

Local Component	AWS Equivalent	Notes
PostgreSQL 16	Amazon RDS PostgreSQL	Managed, multi-AZ available
SeaweedFS	Amazon S3	Native S3, no emulation needed
Spark 4.x	Amazon EMR	Or EMR Serverless for on-demand
Kafka	Amazon MSK	Or MSK Serverless
Docker containers	ECS/EKS	Optional for custom workloads

AWS Architecture

                    ┌─────────────────────────────────────┐
                    │           Amazon VPC                │
                    │                                     │
┌───────────┐       │  ┌─────────────┐  ┌─────────────┐  │
│  Client   │───────┼──│  EMR Cluster │  │  MSK Cluster │  │
└───────────┘       │  │  (Spark 4.x) │  │  (Kafka)     │  │
                    │  └──────┬──────┘  └──────┬──────┘  │
                    │         │                │         │
                    │         ▼                ▼         │
                    │  ┌─────────────────────────────┐   │
                    │  │      Amazon S3 (Data Lake)   │   │
                    │  │  s3://your-bucket/warehouse  │   │
                    │  └─────────────────────────────┘   │
                    │                                     │
                    │  ┌─────────────┐                   │
                    │  │  RDS PostgreSQL │ (Iceberg Catalog)
                    │  └─────────────┘                   │
                    └─────────────────────────────────────┘

Cost Estimates (US regions, on-demand)

Service	Configuration	Monthly Cost
RDS PostgreSQL	db.t3.micro, 20GB	~$15
S3	100GB storage + requests	~$3
EMR	1 master + 2 core (m5.xlarge), 8hr/day	~$200
MSK	kafka.t3.small, 3 brokers	~$150
Development Total		~$370/month

Cost Optimization Tips:

Use EMR Serverless for sporadic workloads (pay per use)
Use Spot instances for EMR workers (60-80% savings)
Use MSK Serverless for low-throughput streaming
Schedule EMR clusters to shut down outside work hours

Setup Instructions

Prerequisites

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install

# Configure credentials
aws configure
# Enter: AWS Access Key ID, Secret Access Key, Region (e.g., us-west-2)

# Install Terraform (optional, for IaC)
brew install terraform  # macOS
# or see https://terraform.io/downloads

1. Create S3 Bucket (Data Lake)

# Create bucket
aws s3 mb s3://your-lakehouse-bucket --region us-west-2

# Create warehouse directory structure
aws s3api put-object --bucket your-lakehouse-bucket --key warehouse/

2. Create RDS PostgreSQL (Iceberg Catalog)

# Create DB subnet group (use your VPC subnets)
aws rds create-db-subnet-group \
  --db-subnet-group-name lakehouse-db-subnet \
  --db-subnet-group-description "Lakehouse DB subnets" \
  --subnet-ids subnet-xxx subnet-yyy

# Create RDS instance
aws rds create-db-instance \
  --db-instance-identifier lakehouse-catalog \
  --db-instance-class db.t3.micro \
  --engine postgres \
  --engine-version 16 \
  --master-username lakehouse \
  --master-user-password YourSecurePassword123 \
  --allocated-storage 20 \
  --db-subnet-group-name lakehouse-db-subnet \
  --vpc-security-group-ids sg-xxx \
  --no-publicly-accessible

# Wait for instance to be available
aws rds wait db-instance-available --db-instance-identifier lakehouse-catalog

# Get endpoint
aws rds describe-db-instances \
  --db-instance-identifier lakehouse-catalog \
  --query 'DBInstances[0].Endpoint.Address' --output text

Create the Iceberg catalog database:

# Connect via bastion or VPN
psql -h <rds-endpoint> -U lakehouse -d postgres
CREATE DATABASE iceberg_catalog;
\q

3. Create EMR Cluster (Spark)

Create a bootstrap script for Iceberg support:

# Save as s3://your-lakehouse-bucket/scripts/bootstrap.sh
#!/bin/bash
sudo pip3 install pyiceberg

# Upload bootstrap script
aws s3 cp bootstrap.sh s3://your-lakehouse-bucket/scripts/

# Create EMR cluster with Spark 3.5 + Iceberg
# Note: EMR doesn't yet support Spark 4.x, use 3.5 with Iceberg 1.4+
aws emr create-cluster \
  --name "lakehouse-cluster" \
  --release-label emr-7.0.0 \
  --applications Name=Spark Name=Hadoop Name=Livy \
  --instance-type m5.xlarge \
  --instance-count 3 \
  --use-default-roles \
  --ec2-attributes SubnetId=subnet-xxx,KeyName=your-key \
  --bootstrap-actions Path=s3://your-lakehouse-bucket/scripts/bootstrap.sh \
  --configurations '[
    {
      "Classification": "spark-defaults",
      "Properties": {
        "spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.iceberg.type": "jdbc",
        "spark.sql.catalog.iceberg.uri": "jdbc:postgresql://<rds-endpoint>:5432/iceberg_catalog",
        "spark.sql.catalog.iceberg.jdbc.user": "lakehouse",
        "spark.sql.catalog.iceberg.jdbc.password": "YourSecurePassword123",
        "spark.sql.catalog.iceberg.warehouse": "s3://your-lakehouse-bucket/warehouse"
      }
    }
  ]'

4. Create MSK Cluster (Kafka) - Optional

# Create MSK configuration
aws kafka create-configuration \
  --name "lakehouse-kafka-config" \
  --kafka-versions "3.6.0" \
  --server-properties file://kafka-config.properties

# Create MSK cluster
aws kafka create-cluster \
  --cluster-name "lakehouse-kafka" \
  --broker-node-group-info file://broker-config.json \
  --kafka-version "3.6.0" \
  --number-of-broker-nodes 3 \
  --encryption-info file://encryption-config.json

For simpler setups, consider MSK Serverless:

aws kafka create-cluster-v2 \
  --cluster-name "lakehouse-kafka-serverless" \
  --serverless '{
    "vpcConfigs": [{
      "subnetIds": ["subnet-xxx", "subnet-yyy"],
      "securityGroupIds": ["sg-xxx"]
    }],
    "clientAuthentication": {
      "sasl": { "iam": { "enabled": true } }
    }
  }'

Configuration Files

spark-defaults.conf (for AWS)

# Iceberg Catalog (JDBC to RDS)
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.type=jdbc
spark.sql.catalog.iceberg.uri=jdbc:postgresql://<rds-endpoint>:5432/iceberg_catalog
spark.sql.catalog.iceberg.jdbc.user=lakehouse
spark.sql.catalog.iceberg.jdbc.password=${POSTGRES_PASSWORD}
spark.sql.catalog.iceberg.warehouse=s3://your-lakehouse-bucket/warehouse

# S3 Configuration
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain

# Performance tuning
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true

.env (for AWS)

# PostgreSQL (RDS)
POSTGRES_USER=lakehouse
POSTGRES_PASSWORD=YourSecurePassword123
POSTGRES_HOST=<rds-endpoint>.rds.amazonaws.com
POSTGRES_PORT=5432

# S3 (native AWS)
S3_ENDPOINT=https://s3.us-west-2.amazonaws.com
S3_ACCESS_KEY=  # Leave empty, use IAM roles
S3_SECRET_KEY=  # Leave empty, use IAM roles
S3_BUCKET=your-lakehouse-bucket
S3_WAREHOUSE=s3://your-lakehouse-bucket/warehouse

# Iceberg
ICEBERG_CATALOG_URI=jdbc:postgresql://${POSTGRES_HOST}:5432/iceberg_catalog
ICEBERG_WAREHOUSE=${S3_WAREHOUSE}

# Kafka (MSK)
KAFKA_BOOTSTRAP_SERVERS=<msk-bootstrap-servers>:9092

EMR Serverless (Recommended for Development)

EMR Serverless is more cost-effective for sporadic workloads:

# Create EMR Serverless application
aws emr-serverless create-application \
  --name lakehouse-spark \
  --release-label emr-7.0.0 \
  --type SPARK \
  --initial-capacity '{
    "DRIVER": {
      "workerCount": 1,
      "workerConfiguration": {
        "cpu": "2vCPU",
        "memory": "4GB"
      }
    },
    "EXECUTOR": {
      "workerCount": 2,
      "workerConfiguration": {
        "cpu": "2vCPU",
        "memory": "4GB"
      }
    }
  }'

# Submit a job
aws emr-serverless start-job-run \
  --application-id <app-id> \
  --execution-role-arn arn:aws:iam::xxx:role/EMRServerlessRole \
  --job-driver '{
    "sparkSubmit": {
      "entryPoint": "s3://your-lakehouse-bucket/scripts/my-job.py",
      "sparkSubmitParameters": "--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog"
    }
  }'

IAM Roles and Policies

EMR Service Role Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-lakehouse-bucket",
        "arn:aws:s3:::your-lakehouse-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "rds-db:connect"
      ],
      "Resource": [
        "arn:aws:rds-db:us-west-2:*:dbuser:*/lakehouse"
      ]
    }
  ]
}

Networking Considerations

VPC Setup

Private Subnets: Run EMR, RDS, and MSK in private subnets
NAT Gateway: Required for EMR nodes to download packages
VPC Endpoints: Create endpoints for S3 to avoid NAT costs
Security Groups:
- EMR → RDS: Allow port 5432
- EMR → MSK: Allow port 9092
- EMR → S3: Via VPC endpoint

# Create S3 VPC endpoint (saves NAT costs)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxx \
  --service-name com.amazonaws.us-west-2.s3 \
  --route-table-ids rtb-xxx

Migrating from Local to AWS

Export local Iceberg tables (if needed):

# Snapshot local tables
spark-submit scripts/export-tables.py --output s3://your-bucket/migration/

Update configuration:
- Change .env to use RDS endpoint
- Change S3 endpoint to AWS S3
- Update spark-defaults.conf

Test connectivity:

# From EMR master node
psql -h <rds-endpoint> -U lakehouse -d iceberg_catalog -c "SELECT 1;"
aws s3 ls s3://your-lakehouse-bucket/warehouse/

Monitoring

CloudWatch Dashboards

Key metrics to monitor:

EMR: AppsRunning, CoreNodesRunning, HDFSUtilization
RDS: CPUUtilization, FreeStorageSpace, DatabaseConnections
MSK: KafkaDataLogsDiskUsed, UnderReplicatedPartitions
S3: BucketSizeBytes, NumberOfObjects

Logging

# EMR logs → S3
aws emr create-cluster ... --log-uri s3://your-bucket/emr-logs/

# Enable RDS enhanced monitoring
aws rds modify-db-instance \
  --db-instance-identifier lakehouse-catalog \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::xxx:role/rds-monitoring-role

Cleanup

# Terminate EMR cluster
aws emr terminate-clusters --cluster-ids j-XXXXX

# Delete RDS instance (careful!)
aws rds delete-db-instance \
  --db-instance-identifier lakehouse-catalog \
  --skip-final-snapshot

# Delete MSK cluster
aws kafka delete-cluster --cluster-arn arn:aws:kafka:...

# Empty and delete S3 bucket
aws s3 rm s3://your-lakehouse-bucket --recursive
aws s3 rb s3://your-lakehouse-bucket

Next Steps

See terraform/ directory for Infrastructure as Code templates
Consider AWS Glue Data Catalog as alternative to JDBC catalog
For production, implement proper CI/CD with CodePipeline or GitHub Actions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud Deployment Guide

Architecture Comparison

AWS Architecture

Cost Estimates (US regions, on-demand)

Setup Instructions

Prerequisites

1. Create S3 Bucket (Data Lake)

2. Create RDS PostgreSQL (Iceberg Catalog)

3. Create EMR Cluster (Spark)

4. Create MSK Cluster (Kafka) - Optional

Configuration Files

spark-defaults.conf (for AWS)

.env (for AWS)

EMR Serverless (Recommended for Development)

IAM Roles and Policies

EMR Service Role Policy

Networking Considerations

VPC Setup

Migrating from Local to AWS

Monitoring

CloudWatch Dashboards

Logging

Cleanup

Next Steps

FilesExpand file tree

aws.md

Latest commit

History

aws.md

File metadata and controls

Cloud Deployment Guide

Architecture Comparison

AWS Architecture

Cost Estimates (US regions, on-demand)

Setup Instructions

Prerequisites

1. Create S3 Bucket (Data Lake)

2. Create RDS PostgreSQL (Iceberg Catalog)

3. Create EMR Cluster (Spark)

4. Create MSK Cluster (Kafka) - Optional

Configuration Files

spark-defaults.conf (for AWS)

.env (for AWS)

EMR Serverless (Recommended for Development)

IAM Roles and Policies

EMR Service Role Policy

Networking Considerations

VPC Setup

Migrating from Local to AWS

Monitoring

CloudWatch Dashboards

Logging

Cleanup

Next Steps