This guide covers deploying the lakehouse stack on AWS using managed services. The same patterns apply to other cloud providers (GCP, Azure) with equivalent services.
| Local Component | AWS Equivalent | Notes |
|---|---|---|
| PostgreSQL 16 | Amazon RDS PostgreSQL | Managed, multi-AZ available |
| SeaweedFS | Amazon S3 | Native S3, no emulation needed |
| Spark 4.x | Amazon EMR | Or EMR Serverless for on-demand |
| Kafka | Amazon MSK | Or MSK Serverless |
| Docker containers | ECS/EKS | Optional for custom workloads |
┌─────────────────────────────────────┐
│ Amazon VPC │
│ │
┌───────────┐ │ ┌─────────────┐ ┌─────────────┐ │
│ Client │───────┼──│ EMR Cluster │ │ MSK Cluster │ │
└───────────┘ │ │ (Spark 4.x) │ │ (Kafka) │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Amazon S3 (Data Lake) │ │
│ │ s3://your-bucket/warehouse │ │
│ └─────────────────────────────┘ │
│ │
│ ┌─────────────┐ │
│ │ RDS PostgreSQL │ (Iceberg Catalog)
│ └─────────────┘ │
└─────────────────────────────────────┘
| Service | Configuration | Monthly Cost |
|---|---|---|
| RDS PostgreSQL | db.t3.micro, 20GB | ~$15 |
| S3 | 100GB storage + requests | ~$3 |
| EMR | 1 master + 2 core (m5.xlarge), 8hr/day | ~$200 |
| MSK | kafka.t3.small, 3 brokers | ~$150 |
| Development Total | ~$370/month |
Cost Optimization Tips:
- Use EMR Serverless for sporadic workloads (pay per use)
- Use Spot instances for EMR workers (60-80% savings)
- Use MSK Serverless for low-throughput streaming
- Schedule EMR clusters to shut down outside work hours
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip && sudo ./aws/install
# Configure credentials
aws configure
# Enter: AWS Access Key ID, Secret Access Key, Region (e.g., us-west-2)
# Install Terraform (optional, for IaC)
brew install terraform # macOS
# or see https://terraform.io/downloads# Create bucket
aws s3 mb s3://your-lakehouse-bucket --region us-west-2
# Create warehouse directory structure
aws s3api put-object --bucket your-lakehouse-bucket --key warehouse/# Create DB subnet group (use your VPC subnets)
aws rds create-db-subnet-group \
--db-subnet-group-name lakehouse-db-subnet \
--db-subnet-group-description "Lakehouse DB subnets" \
--subnet-ids subnet-xxx subnet-yyy
# Create RDS instance
aws rds create-db-instance \
--db-instance-identifier lakehouse-catalog \
--db-instance-class db.t3.micro \
--engine postgres \
--engine-version 16 \
--master-username lakehouse \
--master-user-password YourSecurePassword123 \
--allocated-storage 20 \
--db-subnet-group-name lakehouse-db-subnet \
--vpc-security-group-ids sg-xxx \
--no-publicly-accessible
# Wait for instance to be available
aws rds wait db-instance-available --db-instance-identifier lakehouse-catalog
# Get endpoint
aws rds describe-db-instances \
--db-instance-identifier lakehouse-catalog \
--query 'DBInstances[0].Endpoint.Address' --output textCreate the Iceberg catalog database:
# Connect via bastion or VPN
psql -h <rds-endpoint> -U lakehouse -d postgres
CREATE DATABASE iceberg_catalog;
\qCreate a bootstrap script for Iceberg support:
# Save as s3://your-lakehouse-bucket/scripts/bootstrap.sh
#!/bin/bash
sudo pip3 install pyiceberg# Upload bootstrap script
aws s3 cp bootstrap.sh s3://your-lakehouse-bucket/scripts/
# Create EMR cluster with Spark 3.5 + Iceberg
# Note: EMR doesn't yet support Spark 4.x, use 3.5 with Iceberg 1.4+
aws emr create-cluster \
--name "lakehouse-cluster" \
--release-label emr-7.0.0 \
--applications Name=Spark Name=Hadoop Name=Livy \
--instance-type m5.xlarge \
--instance-count 3 \
--use-default-roles \
--ec2-attributes SubnetId=subnet-xxx,KeyName=your-key \
--bootstrap-actions Path=s3://your-lakehouse-bucket/scripts/bootstrap.sh \
--configurations '[
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.catalog.iceberg": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.iceberg.type": "jdbc",
"spark.sql.catalog.iceberg.uri": "jdbc:postgresql://<rds-endpoint>:5432/iceberg_catalog",
"spark.sql.catalog.iceberg.jdbc.user": "lakehouse",
"spark.sql.catalog.iceberg.jdbc.password": "YourSecurePassword123",
"spark.sql.catalog.iceberg.warehouse": "s3://your-lakehouse-bucket/warehouse"
}
}
]'# Create MSK configuration
aws kafka create-configuration \
--name "lakehouse-kafka-config" \
--kafka-versions "3.6.0" \
--server-properties file://kafka-config.properties
# Create MSK cluster
aws kafka create-cluster \
--cluster-name "lakehouse-kafka" \
--broker-node-group-info file://broker-config.json \
--kafka-version "3.6.0" \
--number-of-broker-nodes 3 \
--encryption-info file://encryption-config.jsonFor simpler setups, consider MSK Serverless:
aws kafka create-cluster-v2 \
--cluster-name "lakehouse-kafka-serverless" \
--serverless '{
"vpcConfigs": [{
"subnetIds": ["subnet-xxx", "subnet-yyy"],
"securityGroupIds": ["sg-xxx"]
}],
"clientAuthentication": {
"sasl": { "iam": { "enabled": true } }
}
}'# Iceberg Catalog (JDBC to RDS)
spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.type=jdbc
spark.sql.catalog.iceberg.uri=jdbc:postgresql://<rds-endpoint>:5432/iceberg_catalog
spark.sql.catalog.iceberg.jdbc.user=lakehouse
spark.sql.catalog.iceberg.jdbc.password=${POSTGRES_PASSWORD}
spark.sql.catalog.iceberg.warehouse=s3://your-lakehouse-bucket/warehouse
# S3 Configuration
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
# Performance tuning
spark.sql.adaptive.enabled=true
spark.sql.adaptive.coalescePartitions.enabled=true# PostgreSQL (RDS)
POSTGRES_USER=lakehouse
POSTGRES_PASSWORD=YourSecurePassword123
POSTGRES_HOST=<rds-endpoint>.rds.amazonaws.com
POSTGRES_PORT=5432
# S3 (native AWS)
S3_ENDPOINT=https://s3.us-west-2.amazonaws.com
S3_ACCESS_KEY= # Leave empty, use IAM roles
S3_SECRET_KEY= # Leave empty, use IAM roles
S3_BUCKET=your-lakehouse-bucket
S3_WAREHOUSE=s3://your-lakehouse-bucket/warehouse
# Iceberg
ICEBERG_CATALOG_URI=jdbc:postgresql://${POSTGRES_HOST}:5432/iceberg_catalog
ICEBERG_WAREHOUSE=${S3_WAREHOUSE}
# Kafka (MSK)
KAFKA_BOOTSTRAP_SERVERS=<msk-bootstrap-servers>:9092EMR Serverless is more cost-effective for sporadic workloads:
# Create EMR Serverless application
aws emr-serverless create-application \
--name lakehouse-spark \
--release-label emr-7.0.0 \
--type SPARK \
--initial-capacity '{
"DRIVER": {
"workerCount": 1,
"workerConfiguration": {
"cpu": "2vCPU",
"memory": "4GB"
}
},
"EXECUTOR": {
"workerCount": 2,
"workerConfiguration": {
"cpu": "2vCPU",
"memory": "4GB"
}
}
}'
# Submit a job
aws emr-serverless start-job-run \
--application-id <app-id> \
--execution-role-arn arn:aws:iam::xxx:role/EMRServerlessRole \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://your-lakehouse-bucket/scripts/my-job.py",
"sparkSubmitParameters": "--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog"
}
}'{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-lakehouse-bucket",
"arn:aws:s3:::your-lakehouse-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"rds-db:connect"
],
"Resource": [
"arn:aws:rds-db:us-west-2:*:dbuser:*/lakehouse"
]
}
]
}- Private Subnets: Run EMR, RDS, and MSK in private subnets
- NAT Gateway: Required for EMR nodes to download packages
- VPC Endpoints: Create endpoints for S3 to avoid NAT costs
- Security Groups:
- EMR → RDS: Allow port 5432
- EMR → MSK: Allow port 9092
- EMR → S3: Via VPC endpoint
# Create S3 VPC endpoint (saves NAT costs)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxx \
--service-name com.amazonaws.us-west-2.s3 \
--route-table-ids rtb-xxx-
Export local Iceberg tables (if needed):
# Snapshot local tables spark-submit scripts/export-tables.py --output s3://your-bucket/migration/ -
Update configuration:
- Change
.envto use RDS endpoint - Change S3 endpoint to AWS S3
- Update spark-defaults.conf
- Change
-
Test connectivity:
# From EMR master node psql -h <rds-endpoint> -U lakehouse -d iceberg_catalog -c "SELECT 1;" aws s3 ls s3://your-lakehouse-bucket/warehouse/
Key metrics to monitor:
- EMR:
AppsRunning,CoreNodesRunning,HDFSUtilization - RDS:
CPUUtilization,FreeStorageSpace,DatabaseConnections - MSK:
KafkaDataLogsDiskUsed,UnderReplicatedPartitions - S3:
BucketSizeBytes,NumberOfObjects
# EMR logs → S3
aws emr create-cluster ... --log-uri s3://your-bucket/emr-logs/
# Enable RDS enhanced monitoring
aws rds modify-db-instance \
--db-instance-identifier lakehouse-catalog \
--monitoring-interval 60 \
--monitoring-role-arn arn:aws:iam::xxx:role/rds-monitoring-role# Terminate EMR cluster
aws emr terminate-clusters --cluster-ids j-XXXXX
# Delete RDS instance (careful!)
aws rds delete-db-instance \
--db-instance-identifier lakehouse-catalog \
--skip-final-snapshot
# Delete MSK cluster
aws kafka delete-cluster --cluster-arn arn:aws:kafka:...
# Empty and delete S3 bucket
aws s3 rm s3://your-lakehouse-bucket --recursive
aws s3 rb s3://your-lakehouse-bucket- See
terraform/directory for Infrastructure as Code templates - Consider AWS Glue Data Catalog as alternative to JDBC catalog
- For production, implement proper CI/CD with CodePipeline or GitHub Actions