A comprehensive ETL pipeline using AWS Glue, S3, and database scripts with Terraform-based CI/CD pipeline designed for easy management by junior developers.
aws-etl-pipeline/
├── .github/
│ └── workflows/
│ ├── deploy-dev.yml
│ ├── deploy-staging.yml
│ └── deploy-prod.yml
├── glue_jobs/
│ ├── extract/
│ ├── transform/
│ └── load/
├── database/
│ ├── migrations/
│ ├── schemas/
│ └── scripts/
├── terraform/
│ ├── environments/
│ ├── modules/
│ └── shared/
├── config/
│ ├── dev.yml
│ ├── staging.yml
│ └── prod.yml
├── scripts/
└── docs/
- AWS CLI configured
- Terraform >= 1.0
- Python 3.8+
- Docker (for local testing)
-
Clone the repository
git clone https://github.com/144853/aws-etl-pipeline.git cd aws-etl-pipeline -
Configure environment variables
cp config/dev.yml.example config/dev.yml # Edit config/dev.yml with your settings -
Initialize Terraform
cd terraform/environments/dev terraform init terraform plan terraform apply
Simple Configuration Changes:
- Edit files in
config/directory to change environment settings - Modify
glue_jobs/scripts for ETL logic changes - Update database schemas in
database/schemas/
Common Tasks:
- Add new Glue job: Copy template from
glue_jobs/templates/ - Database changes: Add migration in
database/migrations/ - Config updates: Edit YAML files in
config/
All configurations are centralized in YAML files for easy management:
config/dev.yml- Development environmentconfig/staging.yml- Staging environmentconfig/prod.yml- Production environment
graph TD
A[Data Sources] --> B[S3 Raw Bucket]
B --> C[AWS Glue ETL Jobs]
C --> D[S3 Processed Bucket]
D --> E[Data Warehouse/RDS]
F[GitHub] --> G[GitHub Actions]
G --> H[Terraform]
H --> I[AWS Infrastructure]
The pipeline automatically:
- Validates Terraform configurations
- Tests Glue job syntax
- Deploys infrastructure changes
- Updates Glue jobs and database schemas
- Runs integration tests
- Create feature branch
- Make changes
- Test locally
- Submit PR
- Pipeline automatically deploys to dev environment
For questions or issues:
- Check docs/troubleshooting.md
- Create an issue in this repository
- Contact the data engineering team
MIT License - see LICENSE file for details.