Architecture • Prereqs • Terraform • Deployment • Monitoring • Updates • Security • Support
This document provides comprehensive instructions for setting up and managing the CommNG infrastructure on AWS using Terraform and GitHub Actions.
- Architecture Overview
- Prerequisites
- Terraform Setup
- GitHub Actions Setup
- Deployment Guide
- Monitoring and Troubleshooting
The CommNG application is deployed on AWS using:
- ECS Fargate: Serverless container orchestration for
server(Node.js 24) andweb(Next.js 15) - Application Load Balancer (ALB): Routes traffic to appropriate services
/api/*and/trpc/*→ Server service- All other routes → Web service
- Auto Scaling: Automatically scales based on CPU, memory, and request count
- Min: 1 task per service
- Max: 10 tasks per service
- Target: 70% CPU, 80% memory, 1000 requests/target
- ECR: Docker image registry
- RDS PostgreSQL: Database (db.t3.micro, 20GB)
- ElastiCache Valkey: Redis-compatible cache
- S3: File storage
- CloudWatch: Logs and monitoring
ALB Path-Based Routing:
/api/*and/trpc/*→ Server (Node.js:3000)/*(Default) → Web (Next.js:3001)
tRPC Configuration:
- Web: Uses
NEXT_PUBLIC_API_BASE_URLto construct the tRPC endpoint.- Local:
http://localhost:3000/api/trpc - Prod:
http://<alb-dns>/api/trpc(routed by ALB to server)
- Local:
- Server: Listens on
/api/trpcat port 3000.
Fargate Tasks (Lowest Configuration):
- CPU: 0.25 vCPU (256 CPU units)
- Memory: 512 MB
- Cost-effective for variable traffic patterns
Auto-scaling Behavior:
- Scales up quickly (60s cooldown) when load increases
- Scales down slowly (300s cooldown) to prevent flapping
- Multiple metrics (CPU, memory, requests) trigger scaling
-
Terraform (>= 1.5.0)
brew install terraform
-
AWS CLI
brew install awscli aws --version
-
Docker
brew install --cask docker
-
AWS Account Setup
- Active AWS account with appropriate permissions
- IAM user with programmatic access
Create an IAM user with these managed policies:
AmazonEC2ContainerRegistryFullAccessAmazonECS_FullAccessAmazonRDSFullAccessAmazonElastiCacheFullAccessAmazonS3FullAccessIAMFullAccess(for creating roles)AmazonVPCFullAccessElasticLoadBalancingFullAccessCloudWatchLogsFullAccessSecretsManagerReadWrite
-
Configure AWS Credentials
aws configure
Enter:
- AWS Access Key ID
- AWS Secret Access Key
- Default region:
us-east-1 - Default output format:
json
-
Navigate to Infrastructure Directory
cd infra -
Initialize Terraform
This downloads provider plugins and sets up the backend:
terraform init
infra/
├── provider.tf # Terraform & AWS provider configuration
├── variables.tf # All configurable variables with descriptions
├── locals.tf # Local values and computed variables
├── data.tf # Data sources (VPC, subnets, etc.)
├── networking.tf # Security groups, ALB, target groups
├── database.tf # RDS PostgreSQL and ElastiCache
├── secrets.tf # Secrets Manager secrets
├── storage.tf # S3 buckets, ECR repositories
├── ecs.tf # ECS cluster, services, task definitions
├── iam.tf # IAM roles and policies
├── monitoring.tf # CloudWatch logs and EventBridge
├── scheduler.tf # Infrastructure scheduler Lambda
├── outputs.tf # Output values
├── terraform.tfvars # Checked-in dev defaults (can be copied)
├── terraform.tfvars.dev.example # Dev template for new environments
└── terraform.tfvars.prod.example # Prod environment template
Before applying changes, always review the execution plan:
terraform planThis shows:
- Resources to be created (green
+) - Resources to be modified (yellow
~) - Resources to be destroyed (red
-)
Save a plan for later application:
terraform plan -out=tfplanOption 1: Apply directly
terraform applyReview the plan and type yes to confirm.
Option 2: Apply a saved plan
terraform apply tfplanAuto-approve (use with caution):
terraform apply -auto-approveIf you have a custom domain, you can set up SSL/TLS using AWS Certificate Manager (ACM).
-
Configure Your Domain Edit
terraform.tfvarsand add your domain name:domain_name = "dev.yourdomain.com"
-
Apply Terraform
terraform apply
-
Get DNS Validation CNAME Record Terraform will output the CNAME record needed for DNS validation:
terraform output acm_certificate_validation_records
-
Add CNAME to DNS Add the output CNAME record to your DNS provider (e.g., Route53, GoDaddy, Cloudflare).
-
Wait for Validation AWS will automatically validate the certificate once the DNS record propagates.
-
Apply Infrastructure
cd infra terraform init terraform plan terraform apply -
Note Important Outputs
After successful apply, Terraform will output:
alb_dns_name- Your application URLecr_server_repository_url- Server ECR URLecr_web_repository_url- Web ECR URLdb_instance_endpoint- Database endpointcache_endpoint- Valkey/Redis endpointecs_cluster_name- ECS cluster nameecs_server_service_name- Server service nameecs_web_service_name- Web service namevapid_keys_secret_arn- VAPID keys secret ARN (needs manual population)
Save these values for GitHub Actions configuration.
-
Set Up VAPID Keys for Push Notifications
Generate and store VAPID keys in AWS Secrets Manager:
# Generate VAPID keys npx web-push generate-vapid-keys # Store in Secrets Manager aws secretsmanager put-secret-value \ --secret-id dev/comm-ng/vapid-keys \ --secret-string '{ "publicKey": "YOUR_VAPID_PUBLIC_KEY", "privateKey": "YOUR_VAPID_PRIVATE_KEY", "contactEmail": "mailto:admin@yourdomain.com" }'
See SECRETS-SETUP.md for detailed instructions.
-
Build and Push Initial Docker Images
Before ECS services can run, you need initial images in ECR:
# Get ECR login aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.us-east-1.amazonaws.com # Build and push server image cd ../server docker build -t comm-ng-server . docker tag comm-ng-server:latest <ecr_server_repository_url>:latest docker push <ecr_server_repository_url>:latest # Build and push web image cd ../web docker build -t comm-ng-web . docker tag comm-ng-web:latest <ecr_web_repository_url>:latest docker push <ecr_web_repository_url>:latest
List all resources:
terraform state listShow specific resource details:
terraform state show aws_ecs_service.serverView outputs:
terraform output
terraform output alb_dns_nameterraform destroyReview the destruction plan carefully before typing yes.
# Format Terraform files
terraform fmt
# Validate configuration
terraform validate
# Show current state
terraform show
# Refresh state from AWS
terraform refresh
# Target specific resource
terraform apply -target=aws_ecs_service.server
# View dependency graph
terraform graph | dot -Tpng > graph.pngThe deployment workflows automatically manage application versions using semantic versioning:
-
Main branch deployments: Bump minor version
- Example:
1.0.5→1.1.0 - Use for: Production releases, feature deployments
- Example:
-
Non-main branch deployments: Bump patch version
- Example:
1.0.5→1.0.6 - Use for: Development deployments, bug fixes, testing
- Example:
-
When you trigger a deployment, the workflow:
- Checks out your specified branch
- Runs
npm version minor(main) ornpm version patch(others) - Updates
package.jsonandpackage-lock.json - Commits with message:
chore(server|web): bump version to X.Y.Z [skip ci] - Pushes the commit to your branch
- Continues with build and deployment
-
The
[skip ci]tag prevents the commit from triggering another workflow run -
Version is displayed in deployment summary
If you need to bump major version or set a specific version:
# In server/ or web/ directory
npm version major # 1.0.0 → 2.0.0
npm version 2.5.3 # Set to specific version
git add package.json package-lock.json
git commit -m "chore: bump version to X.Y.Z"
git pushThen deploy normally - the workflow will bump from your new base version.
You will have to create a IAM user with these policies:
{
"PolicyVersion": {
"Document": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload",
"ecr:DescribeRepositories",
"ecr:ListImages"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ecs:DescribeTaskDefinition",
"ecs:RegisterTaskDefinition",
"ecs:UpdateService",
"ecs:DescribeServices"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": "*",
"Condition": {
"StringLike": {
"iam:PassedToService": "ecs-tasks.amazonaws.com"
}
}
}
]
},
"VersionId": "v1",
"IsDefaultVersion": true,
"CreateDate": "2025-11-04T17:13:03+00:00"
}
}- Go to your repository on GitHub
- Navigate to Settings → Secrets and variables → Actions
- Click New repository secret
Add the following secrets:
| Secret Name | Value | Description |
|---|---|---|
AWS_ACCESS_KEY_ID |
Your AWS Access Key | IAM user access key |
AWS_SECRET_ACCESS_KEY |
Your AWS Secret Key | IAM user secret key |
DEPLOY_KEY |
SSH Private Key | SSH key for git operations (version bumping) |
The DEPLOY_KEY is required for the workflow to push version bumps back to the repository.
-
Generate an SSH Key Pair:
ssh-keygen -t ed25519 -C "github-actions" -f gh-deploy-key -N ""
-
Add Public Key to Repository:
- Go to Settings → Deploy keys
- Click Add deploy key
- Title:
GitHub Actions Deploy Key - Key: Paste content of
gh-deploy-key.pub - Check "Allow write access" (Crucial for version bumping)
- Click Add key
-
Add Private Key to Secrets:
- Go to Settings → Secrets and variables → Actions
- Click New repository secret
- Name:
DEPLOY_KEY - Value: Paste content of
gh-deploy-key(the private key) - Click Add secret
-
Go to Settings → Environments
-
Create three environments:
devstaging(optional)production(optional)
-
For each environment, configure:
- Protection rules (optional):
- Required reviewers for production
- Wait timer
- Environment secrets (if different from repo secrets)
- Protection rules (optional):
The workflows are located at:
.github/workflows/deploy-server.yml- Deploys Node.js backend.github/workflows/deploy-web.yml- Deploys Next.js frontend
Both workflows:
- Trigger manually via
workflow_dispatch - Accept an
environmentinput (dev/staging/production) - Build Docker images
- Push to ECR
- Deploy to ECS with zero-downtime rolling updates
Before triggering a deployment, ensure:
- AWS Account: Active, IAM user created, CLI configured.
- Local Tools: Terraform (>=1.5.0), Docker, Node.js 24+ installed.
- App Config:
- Server listens on port 3000, Web on 3001.
-
DATABASE_URLandREDIS_AUTHhandled via Secrets Manager. -
NODE_ENV=productionset in task definitions.
- Terraform:
terraform planruns without errors. - GitHub:
- Secrets (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,DEPLOY_KEY) configured. - Environments (
dev,staging,production) created.
- Secrets (
-
Navigate to Actions tab in your GitHub repository
-
Deploy Server:
- Select "Deploy Server to ECS" workflow
- Click "Run workflow"
- Enter branch to deploy from (e.g.,
main,feature/ecs-deployment) - Choose environment (
dev,staging, orproduction) - Click "Run workflow"
- The workflow will automatically:
- Bump version (minor for
main, patch for other branches) - Commit and push the version change
- Build and deploy
- Bump version (minor for
-
Deploy Web:
- Select "Deploy Web to ECS" workflow
- Click "Run workflow"
- Enter branch and select environment
- Click "Run workflow"
- Version will be automatically bumped and committed
-
Monitor Deployment:
- Click on the running workflow to see live logs
- Each step shows progress
- Final step shows deployment summary with:
- Service name
- Cluster name
- Image tag
- Commit SHA
The GitHub Actions workflows perform:
- Checkout code - Gets latest code from specified branch
- Configure Git - Sets up git credentials for version commits
- Bump version - Updates
package.jsonversion:- Main branch: Minor version bump (1.0.0 → 1.1.0)
- Other branches: Patch version bump (1.0.0 → 1.0.1)
- Commit & push - Commits version change with
[skip ci]to avoid loops - Configure AWS - Authenticates with AWS using secrets
- Login to ECR - Authenticates Docker with ECR
- Build Docker image - Builds your application container
- Tag images - Tags with commit SHA and
latest - Push to ECR - Uploads images to container registry
- Download task definition - Gets current ECS task config
- Update task definition - Inserts new image reference
- Deploy to ECS - Triggers rolling update
- Wait for stability - Ensures deployment succeeds
ECS performs zero-downtime deployments:
- Launches new tasks with updated image
- Waits for new tasks to pass health checks
- Drains connections from old tasks
- Terminates old tasks
- Auto-scaling adjusts to traffic during deployment
After Terraform creates the infrastructure:
- Initial State: ECS services will fail to start because no images exist
- Fix: Run GitHub Actions workflows OR manually push images (see Terraform Setup step 3)
- Subsequent Deployments: Use GitHub Actions exclusively
ECS Service Status:
AWS Console → ECS → Clusters → dev-comm-ng-cluster → Services
View Logs:
AWS Console → CloudWatch → Log Groups
- /ecs/dev-comm-ng-server
- /ecs/dev-comm-ng-web
Load Balancer Health:
AWS Console → EC2 → Load Balancers → dev-comm-ng-alb → Target Groups
Check service status:
aws ecs describe-services \
--cluster dev-comm-ng-cluster \
--services dev-comm-ng-server-service dev-comm-ng-web-serviceView recent logs (server):
aws logs tail /ecs/dev-comm-ng-server --follow --since 10mView recent logs (web):
aws logs tail /ecs/dev-comm-ng-web --follow --since 10mCheck task status:
aws ecs list-tasks --cluster dev-comm-ng-cluster --service-name dev-comm-ng-server-serviceDescribe a specific task:
aws ecs describe-tasks \
--cluster dev-comm-ng-cluster \
--tasks <task-arn>Symptoms: Tasks start but are marked unhealthy and terminated
Solution:
- Check application logs in CloudWatch
- Verify health check endpoint exists:
- Server:
GET /healthshould return 200 - Web:
GET /should return 200
- Server:
- Ensure application listens on correct port (3000 for server, 3001 for web)
- Check environment variables are set correctly
Symptoms: Task fails with ECR authentication error
Solution:
# Verify image exists
aws ecr describe-images --repository-name comm-ng-server
# Ensure ECS task execution role has ECR permissions
# (Already configured in Terraform)Symptoms: Application logs show database connection errors
Solution:
- Verify RDS is running:
terraform state show aws_db_instance.dev_db_comm_ng - Check security groups allow traffic from ECS tasks
- Verify DATABASE_URL secret is correctly configured
- Test connection from ECS task:
aws ecs execute-command \ --cluster dev-comm-ng-cluster \ --task <task-id> \ --container server \ --interactive \ --command "/bin/sh"
Symptoms: Service doesn't scale despite high load
Solution:
- Check CloudWatch alarms for scaling policies:
aws application-autoscaling describe-scaling-activities \ --service-namespace ecs \ --resource-id service/dev-comm-ng-cluster/dev-comm-ng-server-service
- Verify metrics are being published
- Check cooldown periods haven't been triggered recently
Symptoms: Load balancer returns 502 errors
Causes:
- No healthy targets in target group
- Application not responding on expected port
- Health check failing
Solution:
# Check target health
aws elbv2 describe-target-health \
--target-group-arn <target-group-arn>
# Review ALB access logs
# Enable ALB logging in Terraform if neededAfter deployment, access your application:
# Get ALB DNS name
terraform output alb_dns_name
# Test endpoints
curl http://<alb-dns-name>/
curl http://<alb-dns-name>/api/healthView estimated costs:
AWS Console → Billing → Bills
Key cost factors:
- ECS Fargate: Based on vCPU and memory per second
- RDS: db.t3.micro instance hours
- ElastiCache: Valkey storage and compute
- ALB: Per hour + data processed
- Data transfer: Outbound data
Cost optimization:
- Auto-scaling reduces costs during low traffic
- Consider Reserved Instances for production
- Review CloudWatch logs retention (currently 7 days)
- Edit
infra/main.tftask definitions - Run
terraform planto review changes - Run
terraform apply - ECS will automatically deploy updated task definitions
To change auto-scaling limits:
# In main.tf, modify:
resource "aws_appautoscaling_target" "server" {
max_capacity = 20 # Increase max capacity
min_capacity = 2 # Set minimum baseline
# ...
}Images are updated through GitHub Actions. Manual updates:
# Build new image
docker build -t <ecr-url>:v2.0 .
# Push to ECR
docker push <ecr-url>:v2.0
# Update ECS service
aws ecs update-service \
--cluster dev-comm-ng-cluster \
--service dev-comm-ng-server-service \
--force-new-deployment-
Secrets Management:
- Never commit AWS credentials
- Use AWS Secrets Manager for sensitive data
- Rotate credentials regularly
-
Network Security:
- ECS tasks run in default VPC
- Security groups restrict traffic
- Consider moving to private subnets for production
-
IAM Permissions:
- Follow principle of least privilege
- Use separate IAM roles for different environments
- Enable MFA for AWS console access
-
Image Security:
- Enable ECR image scanning (already configured)
- Review scan results before deployment
- Keep base images updated
Terraform:
cd infra
terraform init # Initialize
terraform plan # Preview changes
terraform apply # Apply changes
terraform output # View outputsAWS CLI:
# View logs
aws logs tail /ecs/dev-comm-ng-server --follow --since 10m
# Force new deployment
aws ecs update-service --cluster dev-comm-ng-cluster --service dev-comm-ng-server-service --force-new-deployment
# Check service status
aws ecs describe-services --cluster dev-comm-ng-cluster --services dev-comm-ng-server-serviceDocker:
# Build locally
docker build -t comm-ng-server ./server
# Manual push (if needed)
aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag comm-ng-server:latest <ecr-url>:latest
docker push <ecr-url>:latest- Terraform AWS Provider Documentation
- AWS ECS Best Practices
- GitHub Actions Documentation
- AWS CLI Reference
For infrastructure issues:
- Check this documentation
- Review CloudWatch logs
- Check AWS Service Health Dashboard
- Open an issue in the repository
Last Updated: November 2, 2025