Skip to content

adarsxh/nyc-taxi-de-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Taxi Lakehouse Pipeline (Bronze → Silver → Gold)

Overview

This project implements a production-style data engineering pipeline using:

  • Apache Spark (Databricks) for distributed processing
  • Delta Lake for ACID-compliant storage
  • Medallion Architecture (Bronze → Silver → Gold)
  • Apache Airflow (Dockerized) for orchestration
  • Incremental Processing with Watermarking
  • SLA Monitoring & Email Alerts

The goal is to simulate a real-world lakehouse architecture with proper layering, incremental ingestion, orchestration, and production-grade setup.


🏗 Architecture

NYC Taxi Parquet (Raw - ADLS) ↓ Bronze Layer (Delta) ↓ Silver Layer (Cleaned + Deduped + Standardized) ↓ Gold Layer (Aggregated Business Metrics) ↓ Airflow (Orchestration & SLA Monitoring)

Components

  • Databricks Jobs

    • Bronze Ingestion
    • Silver Transformation
    • Gold Aggregation
  • Airflow

    • Dockerized setup
    • Scheduler + Webserver
    • Postgres metadata
    • SLA monitoring
    • Databricks API integration

🥉 Bronze Layer

Responsibilities:

  • Raw ingestion from Parquet (ADLS)
  • Add ingestion metadata:
    • ingestion_date
    • ingestion_ts
    • source_file
  • Incremental load using watermark table
  • Append-only Delta table

🥈 Silver Layer

Responsibilities:

  • Data cleaning
  • Null handling
  • Deduplication
  • Type casting
  • Column standardization
  • Business rule validation
  • Incremental merge logic

Deduplication Strategy:

  • Based on business keys (e.g., trip identifiers)
  • Uses Delta merge for idempotent writes

🥇 Gold Layer

Responsibilities:

  • Aggregations for analytics
  • Business KPIs
  • Date-based incremental refresh
  • Partition-aware updates
  • Optimized for BI consumption

Examples:

  • Daily revenue
  • Average fare per vendor
  • Trip distribution metrics

🔁 Incremental Processing

Implemented via:

  • Metadata watermark table
  • Last processed timestamp tracking
  • Delta MERGE operations
  • Idempotent design

Ensures:

  • No full reload required
  • Scalable to large datasets
  • Production-friendly reprocessing

⏱ Orchestration (Airflow)

Dockerized Airflow stack:

  • Postgres (metadata)
  • Webserver (UI)
  • Scheduler (task execution)
  • Environment-based secrets
  • Databricks connection via env variable

DAG Structure

Bronze → Silver → Gold

Features:

  • Retries
  • SLA monitoring
  • Email alerts
  • Persistent metadata storage

🔐 Secrets Management

No credentials are stored in code.

Secrets are managed via:

  • .env file (ignored via .gitignore)
  • Environment variables
  • AIRFLOW_CONN_DATABRICKS_DEFAULT

🐳 Running Locally

1️⃣ Clone repo

  • git clone
  • cd project

2️⃣ Create environment file

  • cp .env.example .env Fill required variables:
  • Databricks PAT
  • SMTP credentials
  • Admin password

3️⃣ Start Airflow

📦 Tech Stack

  • PySpark
  • Delta Lake
  • Databricks Jobs API
  • Apache Airflow 2.8
  • Docker & Docker Compose
  • Postgres
  • Gmail SMTP (for SLA alerts)

🧠 Engineering Highlights

  • Layered lakehouse design
  • Incremental data processing
  • Metadata-driven orchestration
  • Idempotent transformations
  • Production-like Airflow deployment
  • Secure credential management
  • Reproducible local infrastructure

🚀 Future Improvements

  • CI/CD for DAG validation
  • Slack alerts instead of email
  • Data quality checks (Great Expectations)
  • Kubernetes deployment
  • Cost-based partition optimization

Author

Adarsh Data Engineer

About

End-to-end data engineering pipeline on Databricks with Delta Lake, incremental watermarking, and Dockerized Airflow orchestration (Postgres-backed, SLA-enabled).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors