🏗️ Infrastructure Guide

Architecture • Prereqs • Terraform • Deployment • Monitoring • Updates • Security • Support

Infrastructure Setup Guide

This document provides comprehensive instructions for setting up and managing the CommNG infrastructure on AWS using Terraform and GitHub Actions.

Architecture Overview
Prerequisites
Terraform Setup
GitHub Actions Setup
Deployment Guide
Monitoring and Troubleshooting

Architecture Overview

The CommNG application is deployed on AWS using:

ECS Fargate: Serverless container orchestration for server (Node.js 24) and web (Next.js 15)
Application Load Balancer (ALB): Routes traffic to appropriate services
- /api/* and /trpc/* → Server service
- All other routes → Web service
Auto Scaling: Automatically scales based on CPU, memory, and request count
- Min: 1 task per service
- Max: 10 tasks per service
- Target: 70% CPU, 80% memory, 1000 requests/target
ECR: Docker image registry
RDS PostgreSQL: Database (db.t3.micro, 20GB)
ElastiCache Valkey: Redis-compatible cache
S3: File storage
CloudWatch: Logs and monitoring

Application Routing & Configuration

ALB Path-Based Routing:

/api/* and /trpc/* → Server (Node.js:3000)
/* (Default) → Web (Next.js:3001)

tRPC Configuration:

Web: Uses NEXT_PUBLIC_API_BASE_URL to construct the tRPC endpoint.
- Local: http://localhost:3000/api/trpc
- Prod: http://<alb-dns>/api/trpc (routed by ALB to server)
Server: Listens on /api/trpc at port 3000.

Resource Specifications

Fargate Tasks (Lowest Configuration):

CPU: 0.25 vCPU (256 CPU units)
Memory: 512 MB
Cost-effective for variable traffic patterns

Auto-scaling Behavior:

Scales up quickly (60s cooldown) when load increases
Scales down slowly (300s cooldown) to prevent flapping
Multiple metrics (CPU, memory, requests) trigger scaling

Prerequisites

Local Development

Terraform (>= 1.5.0)
```
brew install terraform
```
AWS CLI
```
brew install awscli
aws --version
```
Docker
```
brew install --cask docker
```
AWS Account Setup
- Active AWS account with appropriate permissions
- IAM user with programmatic access

Required AWS Permissions

Create an IAM user with these managed policies:

AmazonEC2ContainerRegistryFullAccess
AmazonECS_FullAccess
AmazonRDSFullAccess
AmazonElastiCacheFullAccess
AmazonS3FullAccess
IAMFullAccess (for creating roles)
AmazonVPCFullAccess
ElasticLoadBalancingFullAccess
CloudWatchLogsFullAccess
SecretsManagerReadWrite

Terraform Setup

Initial Configuration

Configure AWS Credentials
```
aws configure
```
Enter:
- AWS Access Key ID
- AWS Secret Access Key
- Default region: us-east-1
- Default output format: json
Navigate to Infrastructure Directory
```
cd infra
```
Initialize Terraform

This downloads provider plugins and sets up the backend:
```
terraform init
```

File Structure

infra/
├── provider.tf          # Terraform & AWS provider configuration
├── variables.tf         # All configurable variables with descriptions
├── locals.tf            # Local values and computed variables
├── data.tf              # Data sources (VPC, subnets, etc.)
├── networking.tf        # Security groups, ALB, target groups
├── database.tf          # RDS PostgreSQL and ElastiCache
├── secrets.tf           # Secrets Manager secrets
├── storage.tf           # S3 buckets, ECR repositories
├── ecs.tf               # ECS cluster, services, task definitions
├── iam.tf               # IAM roles and policies
├── monitoring.tf        # CloudWatch logs and EventBridge
├── scheduler.tf         # Infrastructure scheduler Lambda
├── outputs.tf           # Output values
├── terraform.tfvars     # Checked-in dev defaults (can be copied)
├── terraform.tfvars.dev.example   # Dev template for new environments
└── terraform.tfvars.prod.example  # Prod environment template

Planning Changes

Before applying changes, always review the execution plan:

terraform plan

This shows:

Resources to be created (green +)
Resources to be modified (yellow ~)
Resources to be destroyed (red -)

Save a plan for later application:

terraform plan -out=tfplan

Applying Changes

Option 1: Apply directly

terraform apply

Review the plan and type yes to confirm.

Option 2: Apply a saved plan

terraform apply tfplan

Auto-approve (use with caution):

terraform apply -auto-approve

HTTPS & Domain Setup (ACM)

If you have a custom domain, you can set up SSL/TLS using AWS Certificate Manager (ACM).

Configure Your Domain Edit terraform.tfvars and add your domain name:
```
domain_name = "dev.yourdomain.com"
```
Apply Terraform
```
terraform apply
```
Get DNS Validation CNAME Record Terraform will output the CNAME record needed for DNS validation:
```
terraform output acm_certificate_validation_records
```
Add CNAME to DNS Add the output CNAME record to your DNS provider (e.g., Route53, GoDaddy, Cloudflare).
Wait for Validation AWS will automatically validate the certificate once the DNS record propagates.

Initial Deployment

Apply Infrastructure

cd infra
terraform init
terraform plan
terraform apply

Note Important Outputs

After successful apply, Terraform will output:
- alb_dns_name - Your application URL
- ecr_server_repository_url - Server ECR URL
- ecr_web_repository_url - Web ECR URL
- db_instance_endpoint - Database endpoint
- cache_endpoint - Valkey/Redis endpoint
- ecs_cluster_name - ECS cluster name
- ecs_server_service_name - Server service name
- ecs_web_service_name - Web service name
- vapid_keys_secret_arn - VAPID keys secret ARN (needs manual population)
Save these values for GitHub Actions configuration.

Set Up VAPID Keys for Push Notifications

Generate and store VAPID keys in AWS Secrets Manager:

# Generate VAPID keys
npx web-push generate-vapid-keys

# Store in Secrets Manager
aws secretsmanager put-secret-value \
  --secret-id dev/comm-ng/vapid-keys \
  --secret-string '{
    "publicKey": "YOUR_VAPID_PUBLIC_KEY",
    "privateKey": "YOUR_VAPID_PRIVATE_KEY",
    "contactEmail": "mailto:admin@yourdomain.com"
  }'

See SECRETS-SETUP.md for detailed instructions.

Build and Push Initial Docker Images

Before ECS services can run, you need initial images in ECR:

# Get ECR login
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.us-east-1.amazonaws.com

# Build and push server image
cd ../server
docker build -t comm-ng-server .
docker tag comm-ng-server:latest <ecr_server_repository_url>:latest
docker push <ecr_server_repository_url>:latest

# Build and push web image
cd ../web
docker build -t comm-ng-web .
docker tag comm-ng-web:latest <ecr_web_repository_url>:latest
docker push <ecr_web_repository_url>:latest

Viewing State

List all resources:

terraform state list

Show specific resource details:

terraform state show aws_ecs_service.server

View outputs:

terraform output
terraform output alb_dns_name

Destroying Infrastructure

⚠️ Warning: This will delete ALL resources

terraform destroy

Review the destruction plan carefully before typing yes.

Common Terraform Commands

# Format Terraform files
terraform fmt

# Validate configuration
terraform validate

# Show current state
terraform show

# Refresh state from AWS
terraform refresh

# Target specific resource
terraform apply -target=aws_ecs_service.server

# View dependency graph
terraform graph | dot -Tpng > graph.png

Versioning Strategy

The deployment workflows automatically manage application versions using semantic versioning:

Version Bumping Rules

Main branch deployments: Bump minor version
- Example: 1.0.5 → 1.1.0
- Use for: Production releases, feature deployments
Non-main branch deployments: Bump patch version
- Example: 1.0.5 → 1.0.6
- Use for: Development deployments, bug fixes, testing

How It Works

When you trigger a deployment, the workflow:
- Checks out your specified branch
- Runs npm version minor (main) or npm version patch (others)
- Updates package.json and package-lock.json
- Commits with message: chore(server|web): bump version to X.Y.Z [skip ci]
- Pushes the commit to your branch
- Continues with build and deployment
The [skip ci] tag prevents the commit from triggering another workflow run
Version is displayed in deployment summary

Manual Version Management

If you need to bump major version or set a specific version:

# In server/ or web/ directory
npm version major      # 1.0.0 → 2.0.0
npm version 2.5.3      # Set to specific version

git add package.json package-lock.json
git commit -m "chore: bump version to X.Y.Z"
git push

Then deploy normally - the workflow will bump from your new base version.

GitHub Actions Setup

You will have to create a IAM user with these policies:

{
    "PolicyVersion": {
        "Document": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "ecr:GetAuthorizationToken",
                        "ecr:BatchCheckLayerAvailability",
                        "ecr:GetDownloadUrlForLayer",
                        "ecr:BatchGetImage",
                        "ecr:PutImage",
                        "ecr:InitiateLayerUpload",
                        "ecr:UploadLayerPart",
                        "ecr:CompleteLayerUpload",
                        "ecr:DescribeRepositories",
                        "ecr:ListImages"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "ecs:DescribeTaskDefinition",
                        "ecs:RegisterTaskDefinition",
                        "ecs:UpdateService",
                        "ecs:DescribeServices"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "iam:PassRole"
                    ],
                    "Resource": "*",
                    "Condition": {
                        "StringLike": {
                            "iam:PassedToService": "ecs-tasks.amazonaws.com"
                        }
                    }
                }
            ]
        },
        "VersionId": "v1",
        "IsDefaultVersion": true,
        "CreateDate": "2025-11-04T17:13:03+00:00"
    }
}

Step 1: Create GitHub Secrets

Go to your repository on GitHub
Navigate to Settings → Secrets and variables → Actions
Click New repository secret

Add the following secrets:

Secret Name	Value	Description
`AWS_ACCESS_KEY_ID`	Your AWS Access Key	IAM user access key
`AWS_SECRET_ACCESS_KEY`	Your AWS Secret Key	IAM user secret key
`DEPLOY_KEY`	SSH Private Key	SSH key for git operations (version bumping)

Setting up the DEPLOY_KEY

The DEPLOY_KEY is required for the workflow to push version bumps back to the repository.

Generate an SSH Key Pair:

ssh-keygen -t ed25519 -C "github-actions" -f gh-deploy-key -N ""

Add Public Key to Repository:
- Go to Settings → Deploy keys
- Click Add deploy key
- Title: GitHub Actions Deploy Key
- Key: Paste content of gh-deploy-key.pub
- Check "Allow write access" (Crucial for version bumping)
- Click Add key
Add Private Key to Secrets:
- Go to Settings → Secrets and variables → Actions
- Click New repository secret
- Name: DEPLOY_KEY
- Value: Paste content of gh-deploy-key (the private key)
- Click Add secret

Step 2: Create GitHub Environments

Go to Settings → Environments
Create three environments:
- dev
- staging (optional)
- production (optional)
For each environment, configure:
- Protection rules (optional):
  - Required reviewers for production
  - Wait timer
- Environment secrets (if different from repo secrets)

Step 3: Verify Workflow Files

The workflows are located at:

.github/workflows/deploy-server.yml - Deploys Node.js backend
.github/workflows/deploy-web.yml - Deploys Next.js frontend

Both workflows:

Trigger manually via workflow_dispatch
Accept an environment input (dev/staging/production)
Build Docker images
Push to ECR
Deploy to ECS with zero-downtime rolling updates

Deployment Guide

Pre-Deployment Checklist

Before triggering a deployment, ensure:

Manual Deployment via GitHub Actions

Navigate to Actions tab in your GitHub repository
Deploy Server:
- Select "Deploy Server to ECS" workflow
- Click "Run workflow"
- Enter branch to deploy from (e.g., main, feature/ecs-deployment)
- Choose environment (dev, staging, or production)
- Click "Run workflow"
- The workflow will automatically:
  - Bump version (minor for main, patch for other branches)
  - Commit and push the version change
  - Build and deploy
Deploy Web:
- Select "Deploy Web to ECS" workflow
- Click "Run workflow"
- Enter branch and select environment
- Click "Run workflow"
- Version will be automatically bumped and committed
Monitor Deployment:
- Click on the running workflow to see live logs
- Each step shows progress
- Final step shows deployment summary with:
  - Service name
  - Cluster name
  - Image tag
  - Commit SHA

Deployment Process

The GitHub Actions workflows perform:

Checkout code - Gets latest code from specified branch
Configure Git - Sets up git credentials for version commits
Bump version - Updates package.json version:
- Main branch: Minor version bump (1.0.0 → 1.1.0)
- Other branches: Patch version bump (1.0.0 → 1.0.1)
Commit & push - Commits version change with [skip ci] to avoid loops
Configure AWS - Authenticates with AWS using secrets
Login to ECR - Authenticates Docker with ECR
Build Docker image - Builds your application container
Tag images - Tags with commit SHA and latest
Push to ECR - Uploads images to container registry
Download task definition - Gets current ECS task config
Update task definition - Inserts new image reference
Deploy to ECS - Triggers rolling update
Wait for stability - Ensures deployment succeeds

Rolling Updates

ECS performs zero-downtime deployments:

Launches new tasks with updated image
Waits for new tasks to pass health checks
Drains connections from old tasks
Terminates old tasks
Auto-scaling adjusts to traffic during deployment

First Deployment Notes

After Terraform creates the infrastructure:

Initial State: ECS services will fail to start because no images exist
Fix: Run GitHub Actions workflows OR manually push images (see Terraform Setup step 3)
Subsequent Deployments: Use GitHub Actions exclusively

Monitoring and Troubleshooting

AWS Console Access

ECS Service Status:

AWS Console → ECS → Clusters → dev-comm-ng-cluster → Services

View Logs:

AWS Console → CloudWatch → Log Groups
- /ecs/dev-comm-ng-server
- /ecs/dev-comm-ng-web

Load Balancer Health:

AWS Console → EC2 → Load Balancers → dev-comm-ng-alb → Target Groups

CLI Monitoring

Check service status:

aws ecs describe-services \
  --cluster dev-comm-ng-cluster \
  --services dev-comm-ng-server-service dev-comm-ng-web-service

View recent logs (server):

aws logs tail /ecs/dev-comm-ng-server --follow --since 10m

View recent logs (web):

aws logs tail /ecs/dev-comm-ng-web --follow --since 10m

Check task status:

aws ecs list-tasks --cluster dev-comm-ng-cluster --service-name dev-comm-ng-server-service

Describe a specific task:

aws ecs describe-tasks \
  --cluster dev-comm-ng-cluster \
  --tasks <task-arn>

Common Issues

1. ECS Tasks Failing Health Checks

Symptoms: Tasks start but are marked unhealthy and terminated

Solution:

Check application logs in CloudWatch
Verify health check endpoint exists:
- Server: GET /health should return 200
- Web: GET / should return 200
Ensure application listens on correct port (3000 for server, 3001 for web)
Check environment variables are set correctly

2. "Unable to pull image" Error

Symptoms: Task fails with ECR authentication error

Solution:

# Verify image exists
aws ecr describe-images --repository-name comm-ng-server

# Ensure ECS task execution role has ECR permissions
# (Already configured in Terraform)

3. Database Connection Failures

Symptoms: Application logs show database connection errors

Solution:

Verify RDS is running: terraform state show aws_db_instance.dev_db_comm_ng
Check security groups allow traffic from ECS tasks
Verify DATABASE_URL secret is correctly configured

Test connection from ECS task:

aws ecs execute-command \
  --cluster dev-comm-ng-cluster \
  --task <task-id> \
  --container server \
  --interactive \
  --command "/bin/sh"

4. Auto-scaling Not Working

Symptoms: Service doesn't scale despite high load

Solution:

Check CloudWatch alarms for scaling policies:

aws application-autoscaling describe-scaling-activities \
  --service-namespace ecs \
  --resource-id service/dev-comm-ng-cluster/dev-comm-ng-server-service

Verify metrics are being published
Check cooldown periods haven't been triggered recently

5. 502 Bad Gateway from ALB

Symptoms: Load balancer returns 502 errors

Causes:

No healthy targets in target group
Application not responding on expected port
Health check failing

Solution:

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

# Review ALB access logs
# Enable ALB logging in Terraform if needed

Accessing Application

After deployment, access your application:

# Get ALB DNS name
terraform output alb_dns_name

# Test endpoints
curl http://<alb-dns-name>/
curl http://<alb-dns-name>/api/health

Cost Monitoring

View estimated costs:

AWS Console → Billing → Bills

Key cost factors:

ECS Fargate: Based on vCPU and memory per second
RDS: db.t3.micro instance hours
ElastiCache: Valkey storage and compute
ALB: Per hour + data processed
Data transfer: Outbound data

Cost optimization:

Auto-scaling reduces costs during low traffic
Consider Reserved Instances for production
Review CloudWatch logs retention (currently 7 days)

Infrastructure Updates

Updating ECS Task Configuration

Edit infra/main.tf task definitions
Run terraform plan to review changes
Run terraform apply
ECS will automatically deploy updated task definitions

Scaling Configuration

To change auto-scaling limits:

# In main.tf, modify:
resource "aws_appautoscaling_target" "server" {
  max_capacity       = 20  # Increase max capacity
  min_capacity       = 2   # Set minimum baseline
  # ...
}

Updating Docker Images

Images are updated through GitHub Actions. Manual updates:

# Build new image
docker build -t <ecr-url>:v2.0 .

# Push to ECR
docker push <ecr-url>:v2.0

# Update ECS service
aws ecs update-service \
  --cluster dev-comm-ng-cluster \
  --service dev-comm-ng-server-service \
  --force-new-deployment

Security Best Practices

Secrets Management:
- Never commit AWS credentials
- Use AWS Secrets Manager for sensitive data
- Rotate credentials regularly
Network Security:
- ECS tasks run in default VPC
- Security groups restrict traffic
- Consider moving to private subnets for production
IAM Permissions:
- Follow principle of least privilege
- Use separate IAM roles for different environments
- Enable MFA for AWS console access
Image Security:
- Enable ECR image scanning (already configured)
- Review scan results before deployment
- Keep base images updated

Cheat Sheet / Quick Reference

Common Commands

Terraform:

cd infra
terraform init      # Initialize
terraform plan      # Preview changes
terraform apply     # Apply changes
terraform output    # View outputs

AWS CLI:

# View logs
aws logs tail /ecs/dev-comm-ng-server --follow --since 10m

# Force new deployment
aws ecs update-service --cluster dev-comm-ng-cluster --service dev-comm-ng-server-service --force-new-deployment

# Check service status
aws ecs describe-services --cluster dev-comm-ng-cluster --services dev-comm-ng-server-service

Docker:

# Build locally
docker build -t comm-ng-server ./server

# Manual push (if needed)
aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag comm-ng-server:latest <ecr-url>:latest
docker push <ecr-url>:latest

Additional Resources

Support

For infrastructure issues:

Check this documentation
Review CloudWatch logs
Check AWS Service Health Dashboard
Open an issue in the repository

Last Updated: November 2, 2025

FilesExpand file tree

INFRA.md

Latest commit

History

INFRA.md

File metadata and controls

🏗️ Infrastructure Guide

Infrastructure Setup Guide

Table of Contents

Architecture Overview

Application Routing & Configuration

Resource Specifications

Prerequisites

Local Development

Required AWS Permissions

Terraform Setup

Initial Configuration

File Structure

Planning Changes

Applying Changes

HTTPS & Domain Setup (ACM)

Initial Deployment

Viewing State

Destroying Infrastructure

Common Terraform Commands

Versioning Strategy

Version Bumping Rules

How It Works

Manual Version Management

GitHub Actions Setup

Step 1: Create GitHub Secrets

Setting up the DEPLOY_KEY

Step 2: Create GitHub Environments

Step 3: Verify Workflow Files

Deployment Guide

Pre-Deployment Checklist

Manual Deployment via GitHub Actions

Deployment Process

Rolling Updates

First Deployment Notes

Monitoring and Troubleshooting

AWS Console Access

CLI Monitoring

Common Issues

1. ECS Tasks Failing Health Checks

2. "Unable to pull image" Error

3. Database Connection Failures

4. Auto-scaling Not Working

5. 502 Bad Gateway from ALB

Accessing Application

Cost Monitoring

Infrastructure Updates

Updating ECS Task Configuration

Scaling Configuration

Updating Docker Images

Security Best Practices

Cheat Sheet / Quick Reference

Common Commands

Additional Resources

Support