Skip to content

ramadiansyah/nyc-taxi-airflow-dbt-gcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚕 nyc-taxi-airflow-dbt-gcp

Project Overview

This repository hosts a robust and scalable Data Engineering ELT (Extract, Load, Transform) pipeline designed to ingest, stage, and transform publicly available trip record data from the NYC Taxi & Limousine Commission (TLC).

The architecture leverages best-in-class open-source tools—Apache Airflow for orchestration and dbt (Data Build Tool) for analytical transformations—all deployed using Docker and utilizing Google Cloud Platform (GCP) services for storage and warehousing.

The primary goal is to build a modern, incremental data warehouse in Google BigQuery suitable for downstream analytics and business intelligence (BI).

Architecture

Architecture

The project follows a modular and event-driven design, ensuring high reliability and maintainability.

[Insert Architecture Diagram Here]
Conceptual diagram of the Data Pipeline flow.

Technology Stack

Category Tool Function
Orchestration Apache Airflow Schedules and manages the entire pipeline (DAGs).
Transformation dbt (Data Build Tool) Handles complex, incremental SQL transformations and modeling.
Cloud Platform Google Cloud Platform (GCP) Provides the core infrastructure (Storage & Data Warehouse).
Data Warehouse Google BigQuery Scalable, serverless, columnar data warehouse.
Storage/Staging Google Cloud Storage (GCS) Data Lake for staging raw Parquet files.
Extraction Pandas Python library used in Airflow for initial data reading/cleaning.
Containerization Docker Used to containerize Airflow and dbt environments.
Data Format Parquet Optimized columnar storage format for efficiency.
Alerting Discord Notification system for DAG failure alerts.

Data Pipeline Stages

The pipeline is executed in three main logical steps, managed by Airflow, which orchestrates the entire flow:

1. Extract & Stage (Airflow)

  • Data Source: NYC TLC Parquet files (publicly available on the web).
  • Process: An Airflow DAG is triggered to fetch the raw data.
    • Pandas is used to read the Parquet files into memory, perform minimal schema validation, and ensure data quality.
    • The cleaned raw data is written to Google Cloud Storage (GCS), serving as the raw data lake/staging area.

2. Load (Airflow & BigQuery)

  • Target: Dataset: Raw table in Google BigQuery.
  • Mechanism: Airflow triggers a BigQuery load job to move data from GCS into the BigQuery Raw table.
  • Incremental Logic: The loading process uses a MERGE statement, which is crucial for idempotent and efficient updates.
    • It checks for data using a row_hash to only insert new records or update changed ones, ensuring data quality and reducing processing cost.

3. Transform (dbt)

  • Tool: dbt is executed incrementally via an Airflow operator (dbt run --incremental).
  • Transformation Logic: dbt runs a series of SQL models to transform the data:
    1. Prep Models: Clean, standardize, and filter the raw data.
    2. Data Warehouse (DW) Models: Apply business logic and join fact/dimension tables.
    3. Mart Models: Create highly aggregated, optimized tables (e.g., dim_dates, fact_trips_hourly) specifically for BI tools.
  • Output: The transformed and modeled data is stored in the Dataset: Prep, DW, Mart tables in BigQuery, ready for consumption.

Deployment and Setup

Prerequisites

  1. Docker & Docker Compose: Must be installed to run the containerized services.
  2. Google Cloud Platform (GCP) Account: Required for BigQuery and GCS.
  3. GCP Service Account: A JSON key file with the necessary permissions (BigQuery Data Editor, Storage Object Admin).

Local Deployment

  1. Clone the repository:
    git clone https://github.com/ramadiansyah/nyc-taxi-airflow-dbt-gcp.git
    cd nyc-taxi-airflow-dbt-gcp
  2. Configure Environment:
    • Place your GCP Service Account JSON key file in the appropriate directory (e.g., dags/keys/gcp_key.json).
    • Update the .env file with your GCP Project ID, GCS Bucket Name, and Discord Webhook URL.
  3. Initialize & Run: Use Docker Compose to build the images and start the Airflow services:
    docker-compose up --build -d
  4. Airflow Access: Once services are running, access the Airflow UI at http://localhost:8080 (Default credentials: airflow/airflow).

Data Warehouse Configuration

Before running the DAG, ensure the following are configured in your Airflow Connections and GCP:

  1. Airflow Connection: Create a Google Cloud connection named google_cloud_default using your Service Account JSON key.
  2. BigQuery Datasets: Manually create the following datasets in BigQuery (or configure the DAG to create them):
    • nyc_taxi_raw
    • nyc_taxi_dw
  3. dbt Profile: The dbt profile (profiles.yml) must be configured to connect to your BigQuery project, typically by inheriting credentials from the Docker container's environment variables.

DAG Failure Notification

In the event of a critical pipeline failure, a Discord webhook is utilized to immediately send a notification, ensuring the data engineering team is alerted and can address the issue promptly.

About

A modern ELT data pipeline built with Apache Airflow (orchestration), dbt (incremental transformations), and Pandas to process NYC Taxi data from raw Parquet files into a scalable data warehouse on Google BigQuery via GCS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages