NYC Taxi Lakehouse Pipeline (Bronze → Silver → Gold)

Overview

This project implements a production-style data engineering pipeline using:

Apache Spark (Databricks) for distributed processing
Delta Lake for ACID-compliant storage
Medallion Architecture (Bronze → Silver → Gold)
Apache Airflow (Dockerized) for orchestration
Incremental Processing with Watermarking
SLA Monitoring & Email Alerts

The goal is to simulate a real-world lakehouse architecture with proper layering, incremental ingestion, orchestration, and production-grade setup.

🏗 Architecture

NYC Taxi Parquet (Raw - ADLS) ↓ Bronze Layer (Delta) ↓ Silver Layer (Cleaned + Deduped + Standardized) ↓ Gold Layer (Aggregated Business Metrics) ↓ Airflow (Orchestration & SLA Monitoring)

Components

Databricks Jobs
- Bronze Ingestion
- Silver Transformation
- Gold Aggregation
Airflow
- Dockerized setup
- Scheduler + Webserver
- Postgres metadata
- SLA monitoring
- Databricks API integration

🥉 Bronze Layer

Responsibilities:

Raw ingestion from Parquet (ADLS)
Add ingestion metadata:
- ingestion_date
- ingestion_ts
- source_file
Incremental load using watermark table
Append-only Delta table

🥈 Silver Layer

Responsibilities:

Data cleaning
Null handling
Deduplication
Type casting
Column standardization
Business rule validation
Incremental merge logic

Deduplication Strategy:

Based on business keys (e.g., trip identifiers)
Uses Delta merge for idempotent writes

🥇 Gold Layer

Responsibilities:

Aggregations for analytics
Business KPIs
Date-based incremental refresh
Partition-aware updates
Optimized for BI consumption

Examples:

Daily revenue
Average fare per vendor
Trip distribution metrics

🔁 Incremental Processing

Implemented via:

Metadata watermark table
Last processed timestamp tracking
Delta MERGE operations
Idempotent design

Ensures:

No full reload required
Scalable to large datasets
Production-friendly reprocessing

⏱ Orchestration (Airflow)

Dockerized Airflow stack:

Postgres (metadata)
Webserver (UI)
Scheduler (task execution)
Environment-based secrets
Databricks connection via env variable

DAG Structure

Bronze → Silver → Gold

Features:

Retries
SLA monitoring
Email alerts
Persistent metadata storage

🔐 Secrets Management

No credentials are stored in code.

Secrets are managed via:

.env file (ignored via .gitignore)
Environment variables
AIRFLOW_CONN_DATABRICKS_DEFAULT

🐳 Running Locally

1️⃣ Clone repo

git clone
cd project

2️⃣ Create environment file

cp .env.example .env Fill required variables:
Databricks PAT
SMTP credentials
Admin password

3️⃣ Start Airflow

docker compose up -d Access UI: http://localhost:8090

📦 Tech Stack

PySpark
Delta Lake
Databricks Jobs API
Apache Airflow 2.8
Docker & Docker Compose
Postgres
Gmail SMTP (for SLA alerts)

🧠 Engineering Highlights

Layered lakehouse design
Incremental data processing
Metadata-driven orchestration
Idempotent transformations
Production-like Airflow deployment
Secure credential management
Reproducible local infrastructure

🚀 Future Improvements

CI/CD for DAG validation
Slack alerts instead of email
Data quality checks (Great Expectations)
Kubernetes deployment
Cost-based partition optimization

Author

Adarsh Data Engineer

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dags		dags
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Taxi Lakehouse Pipeline (Bronze → Silver → Gold)

Overview

🏗 Architecture

Components

🥉 Bronze Layer

🥈 Silver Layer

🥇 Gold Layer

🔁 Incremental Processing

⏱ Orchestration (Airflow)

DAG Structure

🔐 Secrets Management

🐳 Running Locally

1️⃣ Clone repo

2️⃣ Create environment file

3️⃣ Start Airflow

📦 Tech Stack

🧠 Engineering Highlights

🚀 Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Lakehouse Pipeline (Bronze → Silver → Gold)

Overview

🏗 Architecture

Components

🥉 Bronze Layer

🥈 Silver Layer

🥇 Gold Layer

🔁 Incremental Processing

⏱ Orchestration (Airflow)

DAG Structure

🔐 Secrets Management

🐳 Running Locally

1️⃣ Clone repo

2️⃣ Create environment file

3️⃣ Start Airflow

📦 Tech Stack

🧠 Engineering Highlights

🚀 Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages