A comprehensive ETL (Extract, Transform, Load) pipeline built with Apache Airflow for automated data processing, encryption/decryption, validation, and database operations. This project demonstrates modern data engineering practices with containerized deployment and real-time data processing capabilities.
This Airflow-based ETL system:
- Processes encrypted data from multiple API sources
- Performs decryption using Java, ASP.NET, and PHP algorithms
- Validates data integrity
- Stores results in PostgreSQL
Key features:
- Automated scheduling
- Comprehensive logging
- Robust error handling
- Automated DAG scheduling and task orchestration
- Bash operators for modular task execution
- Logging and monitoring
- Containerized with Docker Compose
- RESTful endpoints for data reception
- CORS-enabled
- Real-time data processing
- Token-based authentication
- Dynamic table creation (date-based naming)
- Structured logs with timestamps
- Connection pooling
- Data persistence and retrieval
- Decryption support for Java, ASP.NET, PHP
- Token-based encryption key management
- Integrity and validation checks
- Secure encrypted data pipeline
- Data extraction from multiple APIs
- Multi-language decryption
- Data validation and integrity checks
- Dynamic PostgreSQL table creation
- Logging and error tracking
- Token-based encryption key management
- External API validation (level codes)
- CSV export & temporary file handling
- Timezone-aware timestamp processing
- Docker-based multi-service setup
- Airflow UI + Scheduler
- PostgreSQL + pgAdmin
- Flask API server
- Health checks and monitoring
- Python 3.8+
- Docker + Docker Compose
- PostgreSQL
- Clone the Repository
git clone <repository-url>
cd airflow_tutorial- Start Services
docker-compose up -d- Access Interfaces
- Airflow UI: http://localhost:8080 (user:
airflow, pass:airflow) - Flask API: http://localhost:5001
- pgAdmin: http://localhost:15432 (email:
[email protected], pass:postgres) - PostgreSQL:
localhost:5434
- Initialize DAGs
DAGs are automatically loaded from the dags/ folder.
airflow_tutorial/
βββ dags/
β βββ final_etl.py # Main ETL DAG
β βββ data_migrations.py # Migration DAG
β βββ tomtom/ # Additional DAGs
βββ data/
β βββ config.py # Config & endpoints
β βββ database.py # DB setup & models
β βββ etl_m12.py # ETL core logic
β βββ api_utils.py # API fetching utils
β βββ decryption_utils.py # Decryption functions
β βββ validation_utils.py # Validation logic
β βββ logging_config.py # Logging setup
βββ docker-compose.yaml # Multi-service orchestration
βββ Dockerfile # Airflow Docker image
βββ requirements.txt # Python packages
βββ flask_app.py # Flask API server
- fetch_the_api β Fetch external API data
- validate_the_api β Validate API data
- log_the_data β Configure logging
- decrypt_the_data β Decrypt based on algorithm
- validate_the_data β Final data validation
- final_etl_processing β Store processed data
- table_created β Create PostgreSQL tables dynamically
- fetch_the_api β Perform local DB ETL operations
Supports automatic decryption based on portLanguageId:
- Java β
decrypt_java_api() - ASP.NET β
decrypt_asp_dot_net_api() - PHP β
decrypt_php_data()
Includes:
- Level code verification from external APIs
- Field integrity checks (
ministryCode,projectCode, etc.) - Null/empty value detection
- Token validation
- Granularity & hierarchy validation
Example:
CREATE TABLE logs_YYYY_MM_DD (
id SERIAL PRIMARY KEY,
error VARCHAR,
token VARCHAR,
error_details TEXT,
created_at TIMESTAMP DEFAULT NOW()
);POST /recvApiDataβ Receive API dataGET /getDataβ Fetch stored/encrypted data
- Data API:
http://localhost:3000/getData - Encryption Key API:
http://localhost:3000/getEncryptionKey?token= - Validation API:
http://10.23.124.59:2222/validateDataLevelCode
DATABASE_URL="postgresql://postgres:[email protected]:5432/demo"
AIRFLOW_UID=50000
AIRFLOW_WWW_USER_USERNAME=mehul
AIRFLOW_WWW_USER_PASSWORD=mehul- Extract: Gather data from sources
- Transform: Clean, validate, reshape data
- Load: Store into target DB or warehouse
- Lacks error tracking and conditional logic
- Hard to monitor/scale/debug
- Visual DAGs and workflow management
- Built-in logging, retries, alerts
- Parallel execution and scheduling
- Web UI for monitoring
- Modular, scalable, and production-grade
We welcome contributions: Submit PRs or open issues to contribute!
Created by Mehul Gupta to demonstrate a modern, production-ready ETL pipeline with Airflow, Docker, Flask, and PostgreSQL. Includes secure multi-language decryption, robust validation, and real-time processing.
This project is licensed under the MIT License. See the LICENSE file for more details.
Happy data processing! π