🚕 nyc-taxi-airflow-dbt-gcp

Project Overview

This repository hosts a robust and scalable Data Engineering ELT (Extract, Load, Transform) pipeline designed to ingest, stage, and transform publicly available trip record data from the NYC Taxi & Limousine Commission (TLC).

The architecture leverages best-in-class open-source tools—Apache Airflow for orchestration and dbt (Data Build Tool) for analytical transformations—all deployed using Docker and utilizing Google Cloud Platform (GCP) services for storage and warehousing.

The primary goal is to build a modern, incremental data warehouse in Google BigQuery suitable for downstream analytics and business intelligence (BI).

Architecture

The project follows a modular and event-driven design, ensuring high reliability and maintainability.

[Insert Architecture Diagram Here]
Conceptual diagram of the Data Pipeline flow.

Technology Stack

Category	Tool	Function
Orchestration	Apache Airflow	Schedules and manages the entire pipeline (DAGs).
Transformation	dbt (Data Build Tool)	Handles complex, incremental SQL transformations and modeling.
Cloud Platform	Google Cloud Platform (GCP)	Provides the core infrastructure (Storage & Data Warehouse).
Data Warehouse	Google BigQuery	Scalable, serverless, columnar data warehouse.
Storage/Staging	Google Cloud Storage (GCS)	Data Lake for staging raw Parquet files.
Extraction	Pandas	Python library used in Airflow for initial data reading/cleaning.
Containerization	Docker	Used to containerize Airflow and dbt environments.
Data Format	Parquet	Optimized columnar storage format for efficiency.
Alerting	Discord	Notification system for DAG failure alerts.

Data Pipeline Stages

The pipeline is executed in three main logical steps, managed by Airflow, which orchestrates the entire flow:

1. Extract & Stage (Airflow)

Data Source: NYC TLC Parquet files (publicly available on the web).
Process: An Airflow DAG is triggered to fetch the raw data.
- Pandas is used to read the Parquet files into memory, perform minimal schema validation, and ensure data quality.
- The cleaned raw data is written to Google Cloud Storage (GCS), serving as the raw data lake/staging area.

2. Load (Airflow & BigQuery)

Target: Dataset: Raw table in Google BigQuery.
Mechanism: Airflow triggers a BigQuery load job to move data from GCS into the BigQuery Raw table.
Incremental Logic: The loading process uses a MERGE statement, which is crucial for idempotent and efficient updates.
- It checks for data using a row_hash to only insert new records or update changed ones, ensuring data quality and reducing processing cost.

3. Transform (dbt)

Tool: dbt is executed incrementally via an Airflow operator (dbt run --incremental).
Transformation Logic: dbt runs a series of SQL models to transform the data:
1. Prep Models: Clean, standardize, and filter the raw data.
2. Data Warehouse (DW) Models: Apply business logic and join fact/dimension tables.
3. Mart Models: Create highly aggregated, optimized tables (e.g., dim_dates, fact_trips_hourly) specifically for BI tools.
Output: The transformed and modeled data is stored in the Dataset: Prep, DW, Mart tables in BigQuery, ready for consumption.

Deployment and Setup

Prerequisites

Docker & Docker Compose: Must be installed to run the containerized services.
Google Cloud Platform (GCP) Account: Required for BigQuery and GCS.
GCP Service Account: A JSON key file with the necessary permissions (BigQuery Data Editor, Storage Object Admin).

Local Deployment

Clone the repository:

git clone https://github.com/ramadiansyah/nyc-taxi-airflow-dbt-gcp.git
cd nyc-taxi-airflow-dbt-gcp

Configure Environment:
- Place your GCP Service Account JSON key file in the appropriate directory (e.g., dags/keys/gcp_key.json).
- Update the .env file with your GCP Project ID, GCS Bucket Name, and Discord Webhook URL.
Initialize & Run: Use Docker Compose to build the images and start the Airflow services:
```
docker-compose up --build -d
```
Airflow Access: Once services are running, access the Airflow UI at http://localhost:8080 (Default credentials: airflow/airflow).

Data Warehouse Configuration

Before running the DAG, ensure the following are configured in your Airflow Connections and GCP:

Airflow Connection: Create a Google Cloud connection named google_cloud_default using your Service Account JSON key.
BigQuery Datasets: Manually create the following datasets in BigQuery (or configure the DAG to create them):
- nyc_taxi_raw
- nyc_taxi_dw
dbt Profile: The dbt profile (profiles.yml) must be configured to connect to your BigQuery project, typically by inheriting credentials from the Docker container's environment variables.

DAG Failure Notification

In the event of a critical pipeline failure, a Discord webhook is utilized to immediately send a notification, ensuring the data engineering team is alerted and can address the issue promptly.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.dbt		.dbt
airflow		airflow
dbt/nyc_taxi		dbt/nyc_taxi
docs		docs
gcs		gcs
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚕 nyc-taxi-airflow-dbt-gcp

Project Overview

Architecture

Technology Stack

Data Pipeline Stages

1. Extract & Stage (Airflow)

2. Load (Airflow & BigQuery)

3. Transform (dbt)

Deployment and Setup

Prerequisites

Local Deployment

Data Warehouse Configuration

DAG Failure Notification

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚕 nyc-taxi-airflow-dbt-gcp

Project Overview

Architecture

Technology Stack

Data Pipeline Stages

1. Extract & Stage (Airflow)

2. Load (Airflow & BigQuery)

3. Transform (dbt)

Deployment and Setup

Prerequisites

Local Deployment

Data Warehouse Configuration

DAG Failure Notification

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages