This guide provides comprehensive instructions for rolling back deployments in the deplight-platform infrastructure.
- Overview
- Rollback Methods
- Prerequisites
- Quick Start
- Detailed Procedures
- Troubleshooting
- Best Practices
The deplight-platform supports multiple rollback strategies:
- 🤖 Automatic Rollback 🆕 Self-Healing: Deploy fails → Auto-rollback to last successful version
- ECS Task Definition Rollback ⭐ Most Reliable: Rollback to a specific Task Definition revision
- Terraform Rollback: Redeploy a previous image tag by re-running Terraform
- CodeDeploy Auto-Rollback: Automatic rollback on deployment failure
- Manual CodeDeploy Rollback: Stop current deployment to trigger rollback
| Method | Use Case | Speed | Complexity | Automated | Reliability | Manual Intervention |
|---|---|---|---|---|---|---|
| 🤖 Automatic Rollback 🆕 | Deploy fails → instant rollback | ~3-5 min | Low | ✅ Full | ⭐⭐⭐⭐⭐ | None |
| ECS Task Definition ⭐ | Most reliable, captures full task config | ~3-5 min | Low | Partial | ⭐⭐⭐⭐⭐ | Select revision |
| GitHub Actions Workflow | Recommended for manual rollbacks | ~5-10 min | Low | Yes | ⭐⭐⭐⭐ | Trigger workflow |
| Terraform Script | Manual rollback, local execution | ~5-10 min | Medium | Partial | ⭐⭐⭐⭐ | Run script locally |
| CodeDeploy Auto | Failed deployments | ~2-5 min | Low | Yes | ⭐⭐⭐ | Pre-configured |
| CodeDeploy Manual | Stop in-progress deployment | ~2-5 min | Low | Partial | ⭐⭐⭐ | Stop deployment |
- AWS credentials configured (IAM role or access keys)
- Access to the GitHub repository
- Knowledge of the target image tag (commit SHA) to rollback to
- AWS CLI installed and configured
- Terraform CLI (v1.6.6+) installed
- Bash shell environment
# List recent ECR images
aws ecr describe-images \
--repository-name delightful-deploy \
--region ap-northeast-2 \
--query 'sort_by(imageDetails,&imagePushedAt)[-10:].[imageTags[0],imagePushedAt]' \
--output table
# List recent commits
git log --oneline -10
# Get current deployed image tag
aws ecs describe-services \
--cluster delightful-deploy-cluster \
--services delightful-deploy-service \
--region ap-northeast-2 \
--query 'services[0].taskDefinition' \
--output text | xargs aws ecs describe-task-definition \
--task-definition --region ap-northeast-2 \
--query 'taskDefinition.containerDefinitions[0].image' \
--output textThe Best Option: No action needed! Rollback happens automatically on deployment failure.
How it works:
1. Deploy workflow fails (Terraform apply error, ECS update fails, etc.)
↓
2. Auto-rollback workflow triggers automatically (workflow_run event)
↓
3. Fetches last successful deployment's image tag from artifacts
↓
4. Finds matching ECS Task Definition revision
↓
5. Rolls back ECS service to that revision
↓
6. Waits for service to stabilize
↓
7. ✅ Service restored to last known good state
Features:
- ✅ Zero manual intervention - Happens automatically
- ✅ Fast recovery - 3-5 minutes total
- ✅ Infinite loop prevention - Won't rollback a rollback
- ✅ Safety checks - Validates revisions before rolling back
- ✅ Notification - GitHub workflow summary shows what happened
When it runs:
- Terraform apply fails
- ECS service update fails
- Any step in deployment workflow fails
When it doesn't run:
- No previous successful deployment exists (first deploy)
- The failed workflow was already a rollback (prevents loops)
- Deployment succeeds
Monitoring: Check GitHub Actions to see auto-rollback history.
Disabling auto-rollback: If you need to disable it temporarily, disable the "Auto Rollback on Deployment Failure" workflow in GitHub Actions settings.
- Navigate to Actions tab in GitHub
- Select "Rollback Deployment" workflow
- Click "Run workflow"
- Fill in the parameters:
- environment:
devorprod - image_tag: Previous commit SHA (e.g.,
abc123d) - rollback_type:
terraform(recommended) orcodedeploy - confirm: Type
ROLLBACKexactly
- environment:
- Click "Run workflow" and monitor progress
# Navigate to repository root
cd /path/to/deplight-infra
# Run rollback script
./ops/scripts/rollback/rollback.sh <environment> <image_tag>
# Example
./ops/scripts/rollback/rollback.sh prod abc123d# Recommended: Rollback to a specific Task Definition revision
./ops/scripts/rollback/ecs-taskdef-rollback.sh <environment> [revision]
# Interactive mode (lists recent revisions)
./ops/scripts/rollback/ecs-taskdef-rollback.sh prod
# Direct rollback to revision 42
./ops/scripts/rollback/ecs-taskdef-rollback.sh prod 42Why use this method?
- Captures complete task configuration (CPU, memory, env vars, etc.)
- More reliable than image tag alone
- Faster rollback (direct ECS API call)
- No Terraform state changes
# For in-progress deployments
./ops/scripts/rollback/codedeploy-rollback.sh <environment>
# Example
./ops/scripts/rollback/codedeploy-rollback.sh prodWhen to use: Always enabled - no action needed from you!
How it works technically:
The automatic rollback system consists of three components:
Every successful deployment saves its state:
deployment-state/
├── last-successful-image-tag.txt # e.g., "abc123d"
├── environment.txt # "dev" or "prod"
├── commit-sha.txt # Full commit SHA
└── timestamp.txt # ISO 8601 timestampThese artifacts are stored for 30 days and retrieved during rollback.
Triggered by workflow_run event when "Deploy Service" completes:
on:
workflow_run:
workflows: ["Deploy Service"]
types: [completed]Checks:
- ✅ Was the deployment workflow conclusion
failure? - ✅ Is this NOT already a rollback workflow? (prevents loops)
- ✅ Does a previous successful deployment exist?
If all checks pass → proceed to rollback
Step-by-step process:
-
Fetch last successful deployment
- Downloads artifact from last successful workflow run
- Extracts image tag (e.g.,
abc123d)
-
Find matching Task Definition
- Lists recent Task Definition revisions (last 20)
- Searches for revision with matching image tag
- Falls back to
current_revision - 1if not found
-
Safety checks
- Ensures target revision < current revision (prevents rollback to same/newer)
- Verifies Task Definition exists in ECS
-
Execute rollback
aws ecs update-service \ --cluster delightful-deploy-cluster \ --service delightful-deploy-service \ --task-definition delightful-deploy:42
-
Wait for stability
aws ecs wait services-stable \ --cluster delightful-deploy-cluster \ --services delightful-deploy-service -
Verify rollback
- Checks current Task Definition revision
- Confirms it matches target revision
What happens after auto-rollback:
✅ Success Case:
- GitHub workflow summary shows rollback details
- Service is running previous stable version
- You can investigate the failure, fix it, and re-deploy
❌ Failure Case (no previous deployment):
- Workflow creates notification summary
- Manual intervention required
- This only happens on very first deployment
Infinite Loop Prevention:
The system prevents rollback loops:
if workflow_name contains "Rollback":
skip_auto_rollback() # Don't rollback a rollback!Monitoring Auto-Rollback:
# View recent auto-rollback runs
gh run list --workflow=auto-rollback.yml
# View specific auto-rollback details
gh run view <run-id>Disabling Temporarily:
If you need to debug deployment failures without auto-rollback:
- Go to GitHub → Settings → Actions → Workflows
- Find "Auto Rollback on Deployment Failure"
- Click "Disable workflow"
- After debugging, re-enable it
When to use: Fastest and most reliable rollback, recommended for production issues
Why this method?
- Rolls back entire task configuration, not just image
- Includes CPU, memory, environment variables, logging config
- Faster than Terraform (direct API call)
- No risk of Terraform state issues
Steps:
-
Find target Task Definition revision:
# List recent Task Definition revisions aws ecs list-task-definitions \ --family-prefix delightful-deploy \ --sort DESC \ --max-items 10 \ --region ap-northeast-2 # Get details of specific revision aws ecs describe-task-definition \ --task-definition delightful-deploy:42 \ --region ap-northeast-2
-
Run rollback script (Interactive Mode):
./ops/scripts/rollback/ecs-taskdef-rollback.sh prod
The script will:
- Show recent Task Definition revisions with images and timestamps
- Prompt you to select a revision number
- Perform database migration safety checks
- Require
ROLLBACK-PRODconfirmation for production - Update ECS service
- Wait for service to stabilize
- Verify rollback succeeded
-
Or direct rollback (if you know the revision):
./ops/scripts/rollback/ecs-taskdef-rollback.sh prod 42
-
Verify rollback:
# Check service status aws ecs describe-services \ --cluster delightful-deploy-cluster \ --services delightful-deploy-service \ --region ap-northeast-2 # Verify Task Definition revision aws ecs describe-services \ --cluster delightful-deploy-cluster \ --services delightful-deploy-service \ --region ap-northeast-2 \ --query 'services[0].taskDefinition'
-
Monitor:
- CloudWatch Logs:
/aws/ecs/delightful-deploy - ECS Service Events
- Application health checks
- CloudWatch Logs:
When to use: Rollback to any previous version, most controlled approach
Steps:
-
Identify the target image tag:
# Check recent deployments in git history git log --oneline -10 # Verify image exists in ECR aws ecr describe-images \ --repository-name delightful-deploy \ --image-ids imageTag=abc123d \ --region ap-northeast-2
-
Trigger the rollback workflow:
- Go to GitHub Actions → "Rollback Deployment"
- Select parameters:
- Environment:
prod - Image Tag:
abc123d - Rollback Type:
terraform - Confirm:
ROLLBACK
- Environment:
-
Monitor the rollback:
- Watch the GitHub Actions logs
- Check the workflow summary for verification
-
Verify the rollback:
# Check ECS service status aws ecs describe-services \ --cluster delightful-deploy-cluster \ --services delightful-deploy-service \ --region ap-northeast-2 # Check running tasks aws ecs list-tasks \ --cluster delightful-deploy-cluster \ --service-name delightful-deploy-service \ --region ap-northeast-2
When to use: When GitHub Actions is unavailable or you prefer local control
Steps:
-
Prepare environment:
cd deplight-infra git checkout roll-back git pull origin roll-back -
Configure AWS credentials:
# Via environment variables export AWS_ACCESS_KEY_ID=your-key export AWS_SECRET_ACCESS_KEY=your-secret export AWS_REGION=ap-northeast-2 # Or use AWS CLI profiles export AWS_PROFILE=your-profile
-
Run rollback script:
./ops/scripts/rollback/rollback.sh prod abc123d
-
Review and confirm:
- The script will show Terraform plan
- Review changes carefully
- Type
yeswhen prompted - Type
yesagain to apply
-
Verify deployment:
# Check Terraform outputs cd infrastructure/environments/prod terraform output
When to use: Automatic rollback on deployment failure (already configured)
How it works:
- CodeDeploy monitors deployment health
- On failure (health checks, alarms), automatically rolls back
- No manual intervention required
Verify auto-rollback is enabled:
aws deploy get-deployment-group \
--application-name deplight-prod-app \
--deployment-group-name prod-deployment-group \
--region ap-northeast-2 \
--query 'deploymentGroupInfo.autoRollbackConfiguration'When to use: Stop an in-progress problematic deployment
Steps:
-
Find current deployment:
aws deploy list-deployments \ --application-name deplight-prod-app \ --deployment-group-name prod-deployment-group \ --region ap-northeast-2
-
Run rollback script:
./ops/scripts/rollback/codedeploy-rollback.sh prod
-
Or manually via AWS CLI:
# Stop deployment aws deploy stop-deployment \ --deployment-id d-XXXXXXXXX \ --auto-rollback-enabled \ --region ap-northeast-2 -
Monitor rollback:
# Watch deployment status aws deploy get-deployment \ --deployment-id d-XXXXXXXXX \ --region ap-northeast-2
Problem: Error message "Image tag not found in ECR"
Solution:
# List available tags
aws ecr describe-images \
--repository-name delightful-deploy \
--region ap-northeast-2 \
--query 'imageDetails[*].imageTags[0]' \
--output table
# Verify you're using the correct tag format (commit SHA, 7+ chars)Problem: "Error acquiring the state lock"
Solution:
# Check lock status
aws dynamodb get-item \
--table-name terraform-state-lock \
--key '{"LockID":{"S":"deplight-infra/terraform.tfstate"}}' \
--region ap-northeast-2
# Force unlock (use with caution!)
cd infrastructure/environments/<env>
terraform force-unlock <lock-id>Problem: Terraform completes but ECS shows old image
Solution:
# Force new deployment
aws ecs update-service \
--cluster delightful-deploy-cluster \
--service delightful-deploy-service \
--force-new-deployment \
--region ap-northeast-2
# Wait for deployment to stabilize
aws ecs wait services-stable \
--cluster delightful-deploy-cluster \
--services delightful-deploy-service \
--region ap-northeast-2Problem: Deployment shows "InProgress" for extended time
Solution:
# Check deployment events
aws deploy get-deployment \
--deployment-id d-XXXXXXXXX \
--region ap-northeast-2 \
--query 'deploymentInfo.{status:status,creator:creator,createTime:createTime}'
# If truly stuck (>30 minutes), stop it
aws deploy stop-deployment \
--deployment-id d-XXXXXXXXX \
--auto-rollback-enabled \
--region ap-northeast-2Problem: "AccessDenied" or "UnauthorizedOperation" errors
Solution:
# Verify AWS credentials
aws sts get-caller-identity
# Check IAM permissions
aws iam get-user
aws iam list-attached-user-policies --user-name <your-username>
# For GitHub Actions, verify OIDC role- Document the issue: Record what went wrong and why rollback is needed
- Notify stakeholders: Alert team members about the rollback
- Identify target version: Determine the last known good version
- Check dependencies: Ensure no database migrations or breaking changes
- Monitor closely: Watch logs, metrics, and health checks
- Use staging first: Test rollback in dev/staging before production
- Keep communication open: Update team on progress
- Document steps: Record all commands and actions taken
- Verify functionality: Run smoke tests and health checks
- Monitor for 30 minutes: Watch for any issues post-rollback
- Post-mortem: Conduct incident review
- Update runbooks: Document lessons learned
- Plan fix: Create plan to address the original issue
- Identified correct previous image tag or Task Definition revision
- Verified image exists in ECR (or Task Definition exists)
- Verified database migrations have NOT been applied after target version ⭐
- Confirmed database schema is compatible with target version
- RDS snapshot available (if needed)
- down.sql migration scripts prepared (if applicable)
- Notified team members and stakeholders
- Reviewed Terraform plan carefully (if using Terraform rollback)
- Tested rollback in dev/staging environment first
- Environment-specific confirmation completed:
- Prod: Typed
ROLLBACK-PRODconfirmation - Dev: Typed
yesconfirmation
- Prod: Typed
- Have backup plan if rollback fails
- Terraform state drift check passed (exit code 0)
- ECS task definition updated correctly
- Container image tag matches expected version
- Running task count matches desired count
- Application health checks passing
- Ready to monitor deployment for 15-30 minutes
In case of critical issues, follow this troubleshooting sequence:
# CloudWatch Dashboard URL
https://console.aws.amazon.com/cloudwatch/home?region=ap-northeast-2#dashboards:
# Check recent alarms
aws cloudwatch describe-alarms \
--state-value ALARM \
--region ap-northeast-2# ECS Service Events
aws ecs describe-services \
--cluster delightful-deploy-cluster \
--services delightful-deploy-service \
--region ap-northeast-2 \
--query 'services[0].events[0:10]'
# ECS Console URL
https://console.aws.amazon.com/ecs/home?region=ap-northeast-2#/clusters/delightful-deploy-cluster/services/delightful-deploy-serviceCloudWatch Log Groups:
- ECS Container Logs:
/aws/ecs/delightful-deploy - Lambda Logs:
/aws/lambda/delightful-deploy-ai-analyzer
# Tail recent ECS logs
aws logs tail /aws/ecs/delightful-deploy \
--follow \
--region ap-northeast-2 \
--since 10m
# CloudWatch Logs Console
https://console.aws.amazon.com/cloudwatch/home?region=ap-northeast-2#logsV2:log-groups/log-group/$252Faws$252Fecs$252Fdelightful-deployCodeDeploy Logs:
- Deployment History: CodeDeploy Console
- Agent Logs (if using EC2):
/var/log/aws/codedeploy-agent/codedeploy-agent.log
# List recent deployments
aws deploy list-deployments \
--application-name deplight-prod-app \
--deployment-group-name prod-deployment-group \
--region ap-northeast-2
# Get deployment details
aws deploy get-deployment \
--deployment-id <deployment-id> \
--region ap-northeast-2
# CodeDeploy Console
https://console.aws.amazon.com/codesuite/codedeploy/applications# Check for drift
cd infrastructure/environments/<env>
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no drift
# 1 = error
# 2 = drift detectedIf issues persist after following the above steps, escalate with:
- Current symptoms and error messages
- Steps already taken
- Rollback status (completed/failed)
- CloudWatch logs excerpt
- AWS ECS Deployment Documentation
- AWS CodeDeploy Rollback
- Terraform AWS Provider
- Internal Deployment System Docs
Last Updated: 2025-11-08 Maintained by: Infrastructure Team Review Frequency: Quarterly