Skip to content

Latest commit

 

History

History
886 lines (680 loc) · 23.9 KB

File metadata and controls

886 lines (680 loc) · 23.9 KB
CommNG Favicon

🏗️ Infrastructure Guide

ArchitecturePrereqsTerraformDeploymentMonitoringUpdatesSecuritySupport

Infrastructure Setup Guide

This document provides comprehensive instructions for setting up and managing the CommNG infrastructure on AWS using Terraform and GitHub Actions.

Table of Contents

Architecture Overview

The CommNG application is deployed on AWS using:

  • ECS Fargate: Serverless container orchestration for server (Node.js 24) and web (Next.js 15)
  • Application Load Balancer (ALB): Routes traffic to appropriate services
    • /api/* and /trpc/* → Server service
    • All other routes → Web service
  • Auto Scaling: Automatically scales based on CPU, memory, and request count
    • Min: 1 task per service
    • Max: 10 tasks per service
    • Target: 70% CPU, 80% memory, 1000 requests/target
  • ECR: Docker image registry
  • RDS PostgreSQL: Database (db.t3.micro, 20GB)
  • ElastiCache Valkey: Redis-compatible cache
  • S3: File storage
  • CloudWatch: Logs and monitoring

Application Routing & Configuration

ALB Path-Based Routing:

  • /api/* and /trpc/*Server (Node.js:3000)
  • /* (Default) → Web (Next.js:3001)

tRPC Configuration:

  • Web: Uses NEXT_PUBLIC_API_BASE_URL to construct the tRPC endpoint.
    • Local: http://localhost:3000/api/trpc
    • Prod: http://<alb-dns>/api/trpc (routed by ALB to server)
  • Server: Listens on /api/trpc at port 3000.

Resource Specifications

Fargate Tasks (Lowest Configuration):

  • CPU: 0.25 vCPU (256 CPU units)
  • Memory: 512 MB
  • Cost-effective for variable traffic patterns

Auto-scaling Behavior:

  • Scales up quickly (60s cooldown) when load increases
  • Scales down slowly (300s cooldown) to prevent flapping
  • Multiple metrics (CPU, memory, requests) trigger scaling

Prerequisites

Local Development

  1. Terraform (>= 1.5.0)

    brew install terraform
  2. AWS CLI

    brew install awscli
    aws --version
  3. Docker

    brew install --cask docker
  4. AWS Account Setup

    • Active AWS account with appropriate permissions
    • IAM user with programmatic access

Required AWS Permissions

Create an IAM user with these managed policies:

  • AmazonEC2ContainerRegistryFullAccess
  • AmazonECS_FullAccess
  • AmazonRDSFullAccess
  • AmazonElastiCacheFullAccess
  • AmazonS3FullAccess
  • IAMFullAccess (for creating roles)
  • AmazonVPCFullAccess
  • ElasticLoadBalancingFullAccess
  • CloudWatchLogsFullAccess
  • SecretsManagerReadWrite

Terraform Setup

Initial Configuration

  1. Configure AWS Credentials

    aws configure

    Enter:

    • AWS Access Key ID
    • AWS Secret Access Key
    • Default region: us-east-1
    • Default output format: json
  2. Navigate to Infrastructure Directory

    cd infra
  3. Initialize Terraform

    This downloads provider plugins and sets up the backend:

    terraform init

File Structure

infra/
├── provider.tf          # Terraform & AWS provider configuration
├── variables.tf         # All configurable variables with descriptions
├── locals.tf            # Local values and computed variables
├── data.tf              # Data sources (VPC, subnets, etc.)
├── networking.tf        # Security groups, ALB, target groups
├── database.tf          # RDS PostgreSQL and ElastiCache
├── secrets.tf           # Secrets Manager secrets
├── storage.tf           # S3 buckets, ECR repositories
├── ecs.tf               # ECS cluster, services, task definitions
├── iam.tf               # IAM roles and policies
├── monitoring.tf        # CloudWatch logs and EventBridge
├── scheduler.tf         # Infrastructure scheduler Lambda
├── outputs.tf           # Output values
├── terraform.tfvars     # Checked-in dev defaults (can be copied)
├── terraform.tfvars.dev.example   # Dev template for new environments
└── terraform.tfvars.prod.example  # Prod environment template

Planning Changes

Before applying changes, always review the execution plan:

terraform plan

This shows:

  • Resources to be created (green +)
  • Resources to be modified (yellow ~)
  • Resources to be destroyed (red -)

Save a plan for later application:

terraform plan -out=tfplan

Applying Changes

Option 1: Apply directly

terraform apply

Review the plan and type yes to confirm.

Option 2: Apply a saved plan

terraform apply tfplan

Auto-approve (use with caution):

terraform apply -auto-approve

HTTPS & Domain Setup (ACM)

If you have a custom domain, you can set up SSL/TLS using AWS Certificate Manager (ACM).

  1. Configure Your Domain Edit terraform.tfvars and add your domain name:

    domain_name = "dev.yourdomain.com"
  2. Apply Terraform

    terraform apply
  3. Get DNS Validation CNAME Record Terraform will output the CNAME record needed for DNS validation:

    terraform output acm_certificate_validation_records
  4. Add CNAME to DNS Add the output CNAME record to your DNS provider (e.g., Route53, GoDaddy, Cloudflare).

  5. Wait for Validation AWS will automatically validate the certificate once the DNS record propagates.

Initial Deployment

  1. Apply Infrastructure

    cd infra
    terraform init
    terraform plan
    terraform apply
  2. Note Important Outputs

    After successful apply, Terraform will output:

    • alb_dns_name - Your application URL
    • ecr_server_repository_url - Server ECR URL
    • ecr_web_repository_url - Web ECR URL
    • db_instance_endpoint - Database endpoint
    • cache_endpoint - Valkey/Redis endpoint
    • ecs_cluster_name - ECS cluster name
    • ecs_server_service_name - Server service name
    • ecs_web_service_name - Web service name
    • vapid_keys_secret_arn - VAPID keys secret ARN (needs manual population)

    Save these values for GitHub Actions configuration.

  3. Set Up VAPID Keys for Push Notifications

    Generate and store VAPID keys in AWS Secrets Manager:

    # Generate VAPID keys
    npx web-push generate-vapid-keys
    
    # Store in Secrets Manager
    aws secretsmanager put-secret-value \
      --secret-id dev/comm-ng/vapid-keys \
      --secret-string '{
        "publicKey": "YOUR_VAPID_PUBLIC_KEY",
        "privateKey": "YOUR_VAPID_PRIVATE_KEY",
        "contactEmail": "mailto:admin@yourdomain.com"
      }'

    See SECRETS-SETUP.md for detailed instructions.

  4. Build and Push Initial Docker Images

    Before ECS services can run, you need initial images in ECR:

    # Get ECR login
    aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-account-id>.dkr.ecr.us-east-1.amazonaws.com
    
    # Build and push server image
    cd ../server
    docker build -t comm-ng-server .
    docker tag comm-ng-server:latest <ecr_server_repository_url>:latest
    docker push <ecr_server_repository_url>:latest
    
    # Build and push web image
    cd ../web
    docker build -t comm-ng-web .
    docker tag comm-ng-web:latest <ecr_web_repository_url>:latest
    docker push <ecr_web_repository_url>:latest

Viewing State

List all resources:

terraform state list

Show specific resource details:

terraform state show aws_ecs_service.server

View outputs:

terraform output
terraform output alb_dns_name

Destroying Infrastructure

⚠️ Warning: This will delete ALL resources

terraform destroy

Review the destruction plan carefully before typing yes.

Common Terraform Commands

# Format Terraform files
terraform fmt

# Validate configuration
terraform validate

# Show current state
terraform show

# Refresh state from AWS
terraform refresh

# Target specific resource
terraform apply -target=aws_ecs_service.server

# View dependency graph
terraform graph | dot -Tpng > graph.png

Versioning Strategy

The deployment workflows automatically manage application versions using semantic versioning:

Version Bumping Rules

  • Main branch deployments: Bump minor version

    • Example: 1.0.51.1.0
    • Use for: Production releases, feature deployments
  • Non-main branch deployments: Bump patch version

    • Example: 1.0.51.0.6
    • Use for: Development deployments, bug fixes, testing

How It Works

  1. When you trigger a deployment, the workflow:

    • Checks out your specified branch
    • Runs npm version minor (main) or npm version patch (others)
    • Updates package.json and package-lock.json
    • Commits with message: chore(server|web): bump version to X.Y.Z [skip ci]
    • Pushes the commit to your branch
    • Continues with build and deployment
  2. The [skip ci] tag prevents the commit from triggering another workflow run

  3. Version is displayed in deployment summary

Manual Version Management

If you need to bump major version or set a specific version:

# In server/ or web/ directory
npm version major      # 1.0.0 → 2.0.0
npm version 2.5.3      # Set to specific version

git add package.json package-lock.json
git commit -m "chore: bump version to X.Y.Z"
git push

Then deploy normally - the workflow will bump from your new base version.

GitHub Actions Setup

You will have to create a IAM user with these policies:

{
    "PolicyVersion": {
        "Document": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Action": [
                        "ecr:GetAuthorizationToken",
                        "ecr:BatchCheckLayerAvailability",
                        "ecr:GetDownloadUrlForLayer",
                        "ecr:BatchGetImage",
                        "ecr:PutImage",
                        "ecr:InitiateLayerUpload",
                        "ecr:UploadLayerPart",
                        "ecr:CompleteLayerUpload",
                        "ecr:DescribeRepositories",
                        "ecr:ListImages"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "ecs:DescribeTaskDefinition",
                        "ecs:RegisterTaskDefinition",
                        "ecs:UpdateService",
                        "ecs:DescribeServices"
                    ],
                    "Resource": "*"
                },
                {
                    "Effect": "Allow",
                    "Action": [
                        "iam:PassRole"
                    ],
                    "Resource": "*",
                    "Condition": {
                        "StringLike": {
                            "iam:PassedToService": "ecs-tasks.amazonaws.com"
                        }
                    }
                }
            ]
        },
        "VersionId": "v1",
        "IsDefaultVersion": true,
        "CreateDate": "2025-11-04T17:13:03+00:00"
    }
}

Step 1: Create GitHub Secrets

  1. Go to your repository on GitHub
  2. Navigate to SettingsSecrets and variablesActions
  3. Click New repository secret

Add the following secrets:

Secret Name Value Description
AWS_ACCESS_KEY_ID Your AWS Access Key IAM user access key
AWS_SECRET_ACCESS_KEY Your AWS Secret Key IAM user secret key
DEPLOY_KEY SSH Private Key SSH key for git operations (version bumping)

Setting up the DEPLOY_KEY

The DEPLOY_KEY is required for the workflow to push version bumps back to the repository.

  1. Generate an SSH Key Pair:

    ssh-keygen -t ed25519 -C "github-actions" -f gh-deploy-key -N ""
  2. Add Public Key to Repository:

    • Go to SettingsDeploy keys
    • Click Add deploy key
    • Title: GitHub Actions Deploy Key
    • Key: Paste content of gh-deploy-key.pub
    • Check "Allow write access" (Crucial for version bumping)
    • Click Add key
  3. Add Private Key to Secrets:

    • Go to SettingsSecrets and variablesActions
    • Click New repository secret
    • Name: DEPLOY_KEY
    • Value: Paste content of gh-deploy-key (the private key)
    • Click Add secret

Step 2: Create GitHub Environments

  1. Go to SettingsEnvironments

  2. Create three environments:

    • dev
    • staging (optional)
    • production (optional)
  3. For each environment, configure:

    • Protection rules (optional):
      • Required reviewers for production
      • Wait timer
    • Environment secrets (if different from repo secrets)

Step 3: Verify Workflow Files

The workflows are located at:

  • .github/workflows/deploy-server.yml - Deploys Node.js backend
  • .github/workflows/deploy-web.yml - Deploys Next.js frontend

Both workflows:

  • Trigger manually via workflow_dispatch
  • Accept an environment input (dev/staging/production)
  • Build Docker images
  • Push to ECR
  • Deploy to ECS with zero-downtime rolling updates

Deployment Guide

Pre-Deployment Checklist

Before triggering a deployment, ensure:

  • AWS Account: Active, IAM user created, CLI configured.
  • Local Tools: Terraform (>=1.5.0), Docker, Node.js 24+ installed.
  • App Config:
    • Server listens on port 3000, Web on 3001.
    • DATABASE_URL and REDIS_AUTH handled via Secrets Manager.
    • NODE_ENV=production set in task definitions.
  • Terraform: terraform plan runs without errors.
  • GitHub:
    • Secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DEPLOY_KEY) configured.
    • Environments (dev, staging, production) created.

Manual Deployment via GitHub Actions

  1. Navigate to Actions tab in your GitHub repository

  2. Deploy Server:

    • Select "Deploy Server to ECS" workflow
    • Click "Run workflow"
    • Enter branch to deploy from (e.g., main, feature/ecs-deployment)
    • Choose environment (dev, staging, or production)
    • Click "Run workflow"
    • The workflow will automatically:
      • Bump version (minor for main, patch for other branches)
      • Commit and push the version change
      • Build and deploy
  3. Deploy Web:

    • Select "Deploy Web to ECS" workflow
    • Click "Run workflow"
    • Enter branch and select environment
    • Click "Run workflow"
    • Version will be automatically bumped and committed
  4. Monitor Deployment:

    • Click on the running workflow to see live logs
    • Each step shows progress
    • Final step shows deployment summary with:
      • Service name
      • Cluster name
      • Image tag
      • Commit SHA

Deployment Process

The GitHub Actions workflows perform:

  1. Checkout code - Gets latest code from specified branch
  2. Configure Git - Sets up git credentials for version commits
  3. Bump version - Updates package.json version:
    • Main branch: Minor version bump (1.0.0 → 1.1.0)
    • Other branches: Patch version bump (1.0.0 → 1.0.1)
  4. Commit & push - Commits version change with [skip ci] to avoid loops
  5. Configure AWS - Authenticates with AWS using secrets
  6. Login to ECR - Authenticates Docker with ECR
  7. Build Docker image - Builds your application container
  8. Tag images - Tags with commit SHA and latest
  9. Push to ECR - Uploads images to container registry
  10. Download task definition - Gets current ECS task config
  11. Update task definition - Inserts new image reference
  12. Deploy to ECS - Triggers rolling update
  13. Wait for stability - Ensures deployment succeeds

Rolling Updates

ECS performs zero-downtime deployments:

  1. Launches new tasks with updated image
  2. Waits for new tasks to pass health checks
  3. Drains connections from old tasks
  4. Terminates old tasks
  5. Auto-scaling adjusts to traffic during deployment

First Deployment Notes

After Terraform creates the infrastructure:

  1. Initial State: ECS services will fail to start because no images exist
  2. Fix: Run GitHub Actions workflows OR manually push images (see Terraform Setup step 3)
  3. Subsequent Deployments: Use GitHub Actions exclusively

Monitoring and Troubleshooting

AWS Console Access

ECS Service Status:

AWS Console → ECS → Clusters → dev-comm-ng-cluster → Services

View Logs:

AWS Console → CloudWatch → Log Groups
- /ecs/dev-comm-ng-server
- /ecs/dev-comm-ng-web

Load Balancer Health:

AWS Console → EC2 → Load Balancers → dev-comm-ng-alb → Target Groups

CLI Monitoring

Check service status:

aws ecs describe-services \
  --cluster dev-comm-ng-cluster \
  --services dev-comm-ng-server-service dev-comm-ng-web-service

View recent logs (server):

aws logs tail /ecs/dev-comm-ng-server --follow --since 10m

View recent logs (web):

aws logs tail /ecs/dev-comm-ng-web --follow --since 10m

Check task status:

aws ecs list-tasks --cluster dev-comm-ng-cluster --service-name dev-comm-ng-server-service

Describe a specific task:

aws ecs describe-tasks \
  --cluster dev-comm-ng-cluster \
  --tasks <task-arn>

Common Issues

1. ECS Tasks Failing Health Checks

Symptoms: Tasks start but are marked unhealthy and terminated

Solution:

  • Check application logs in CloudWatch
  • Verify health check endpoint exists:
    • Server: GET /health should return 200
    • Web: GET / should return 200
  • Ensure application listens on correct port (3000 for server, 3001 for web)
  • Check environment variables are set correctly

2. "Unable to pull image" Error

Symptoms: Task fails with ECR authentication error

Solution:

# Verify image exists
aws ecr describe-images --repository-name comm-ng-server

# Ensure ECS task execution role has ECR permissions
# (Already configured in Terraform)

3. Database Connection Failures

Symptoms: Application logs show database connection errors

Solution:

  • Verify RDS is running: terraform state show aws_db_instance.dev_db_comm_ng
  • Check security groups allow traffic from ECS tasks
  • Verify DATABASE_URL secret is correctly configured
  • Test connection from ECS task:
    aws ecs execute-command \
      --cluster dev-comm-ng-cluster \
      --task <task-id> \
      --container server \
      --interactive \
      --command "/bin/sh"

4. Auto-scaling Not Working

Symptoms: Service doesn't scale despite high load

Solution:

  • Check CloudWatch alarms for scaling policies:
    aws application-autoscaling describe-scaling-activities \
      --service-namespace ecs \
      --resource-id service/dev-comm-ng-cluster/dev-comm-ng-server-service
  • Verify metrics are being published
  • Check cooldown periods haven't been triggered recently

5. 502 Bad Gateway from ALB

Symptoms: Load balancer returns 502 errors

Causes:

  • No healthy targets in target group
  • Application not responding on expected port
  • Health check failing

Solution:

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn <target-group-arn>

# Review ALB access logs
# Enable ALB logging in Terraform if needed

Accessing Application

After deployment, access your application:

# Get ALB DNS name
terraform output alb_dns_name

# Test endpoints
curl http://<alb-dns-name>/
curl http://<alb-dns-name>/api/health

Cost Monitoring

View estimated costs:

AWS Console → Billing → Bills

Key cost factors:

  • ECS Fargate: Based on vCPU and memory per second
  • RDS: db.t3.micro instance hours
  • ElastiCache: Valkey storage and compute
  • ALB: Per hour + data processed
  • Data transfer: Outbound data

Cost optimization:

  • Auto-scaling reduces costs during low traffic
  • Consider Reserved Instances for production
  • Review CloudWatch logs retention (currently 7 days)

Infrastructure Updates

Updating ECS Task Configuration

  1. Edit infra/main.tf task definitions
  2. Run terraform plan to review changes
  3. Run terraform apply
  4. ECS will automatically deploy updated task definitions

Scaling Configuration

To change auto-scaling limits:

# In main.tf, modify:
resource "aws_appautoscaling_target" "server" {
  max_capacity       = 20  # Increase max capacity
  min_capacity       = 2   # Set minimum baseline
  # ...
}

Updating Docker Images

Images are updated through GitHub Actions. Manual updates:

# Build new image
docker build -t <ecr-url>:v2.0 .

# Push to ECR
docker push <ecr-url>:v2.0

# Update ECS service
aws ecs update-service \
  --cluster dev-comm-ng-cluster \
  --service dev-comm-ng-server-service \
  --force-new-deployment

Security Best Practices

  1. Secrets Management:

    • Never commit AWS credentials
    • Use AWS Secrets Manager for sensitive data
    • Rotate credentials regularly
  2. Network Security:

    • ECS tasks run in default VPC
    • Security groups restrict traffic
    • Consider moving to private subnets for production
  3. IAM Permissions:

    • Follow principle of least privilege
    • Use separate IAM roles for different environments
    • Enable MFA for AWS console access
  4. Image Security:

    • Enable ECR image scanning (already configured)
    • Review scan results before deployment
    • Keep base images updated

Cheat Sheet / Quick Reference

Common Commands

Terraform:

cd infra
terraform init      # Initialize
terraform plan      # Preview changes
terraform apply     # Apply changes
terraform output    # View outputs

AWS CLI:

# View logs
aws logs tail /ecs/dev-comm-ng-server --follow --since 10m

# Force new deployment
aws ecs update-service --cluster dev-comm-ng-cluster --service dev-comm-ng-server-service --force-new-deployment

# Check service status
aws ecs describe-services --cluster dev-comm-ng-cluster --services dev-comm-ng-server-service

Docker:

# Build locally
docker build -t comm-ng-server ./server

# Manual push (if needed)
aws ecr get-login-password | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag comm-ng-server:latest <ecr-url>:latest
docker push <ecr-url>:latest

Additional Resources

Support

For infrastructure issues:

  1. Check this documentation
  2. Review CloudWatch logs
  3. Check AWS Service Health Dashboard
  4. Open an issue in the repository

Last Updated: November 2, 2025