This project implements a production-style data engineering pipeline using:
- Apache Spark (Databricks) for distributed processing
- Delta Lake for ACID-compliant storage
- Medallion Architecture (Bronze → Silver → Gold)
- Apache Airflow (Dockerized) for orchestration
- Incremental Processing with Watermarking
- SLA Monitoring & Email Alerts
The goal is to simulate a real-world lakehouse architecture with proper layering, incremental ingestion, orchestration, and production-grade setup.
NYC Taxi Parquet (Raw - ADLS) ↓ Bronze Layer (Delta) ↓ Silver Layer (Cleaned + Deduped + Standardized) ↓ Gold Layer (Aggregated Business Metrics) ↓ Airflow (Orchestration & SLA Monitoring)
-
Databricks Jobs
- Bronze Ingestion
- Silver Transformation
- Gold Aggregation
-
Airflow
- Dockerized setup
- Scheduler + Webserver
- Postgres metadata
- SLA monitoring
- Databricks API integration
Responsibilities:
- Raw ingestion from Parquet (ADLS)
- Add ingestion metadata:
ingestion_dateingestion_tssource_file
- Incremental load using watermark table
- Append-only Delta table
Responsibilities:
- Data cleaning
- Null handling
- Deduplication
- Type casting
- Column standardization
- Business rule validation
- Incremental merge logic
Deduplication Strategy:
- Based on business keys (e.g., trip identifiers)
- Uses Delta merge for idempotent writes
Responsibilities:
- Aggregations for analytics
- Business KPIs
- Date-based incremental refresh
- Partition-aware updates
- Optimized for BI consumption
Examples:
- Daily revenue
- Average fare per vendor
- Trip distribution metrics
Implemented via:
- Metadata watermark table
- Last processed timestamp tracking
- Delta MERGE operations
- Idempotent design
Ensures:
- No full reload required
- Scalable to large datasets
- Production-friendly reprocessing
Dockerized Airflow stack:
- Postgres (metadata)
- Webserver (UI)
- Scheduler (task execution)
- Environment-based secrets
- Databricks connection via env variable
Bronze → Silver → Gold
Features:
- Retries
- SLA monitoring
- Email alerts
- Persistent metadata storage
No credentials are stored in code.
Secrets are managed via:
.envfile (ignored via.gitignore)- Environment variables
AIRFLOW_CONN_DATABRICKS_DEFAULT
- git clone
- cd project
- cp .env.example .env Fill required variables:
- Databricks PAT
- SMTP credentials
- Admin password
- docker compose up -d Access UI: http://localhost:8090
- PySpark
- Delta Lake
- Databricks Jobs API
- Apache Airflow 2.8
- Docker & Docker Compose
- Postgres
- Gmail SMTP (for SLA alerts)
- Layered lakehouse design
- Incremental data processing
- Metadata-driven orchestration
- Idempotent transformations
- Production-like Airflow deployment
- Secure credential management
- Reproducible local infrastructure
- CI/CD for DAG validation
- Slack alerts instead of email
- Data quality checks (Great Expectations)
- Kubernetes deployment
- Cost-based partition optimization
Adarsh Data Engineer