This repository hosts a robust and scalable Data Engineering ELT (Extract, Load, Transform) pipeline designed to ingest, stage, and transform publicly available trip record data from the NYC Taxi & Limousine Commission (TLC).
The architecture leverages best-in-class open-source tools—Apache Airflow for orchestration and dbt (Data Build Tool) for analytical transformations—all deployed using Docker and utilizing Google Cloud Platform (GCP) services for storage and warehousing.
The primary goal is to build a modern, incremental data warehouse in Google BigQuery suitable for downstream analytics and business intelligence (BI).
The project follows a modular and event-driven design, ensuring high reliability and maintainability.
| [Insert Architecture Diagram Here] |
|---|
| Conceptual diagram of the Data Pipeline flow. |
| Category | Tool | Function |
|---|---|---|
| Orchestration | Apache Airflow | Schedules and manages the entire pipeline (DAGs). |
| Transformation | dbt (Data Build Tool) | Handles complex, incremental SQL transformations and modeling. |
| Cloud Platform | Google Cloud Platform (GCP) | Provides the core infrastructure (Storage & Data Warehouse). |
| Data Warehouse | Google BigQuery | Scalable, serverless, columnar data warehouse. |
| Storage/Staging | Google Cloud Storage (GCS) | Data Lake for staging raw Parquet files. |
| Extraction | Pandas | Python library used in Airflow for initial data reading/cleaning. |
| Containerization | Docker | Used to containerize Airflow and dbt environments. |
| Data Format | Parquet | Optimized columnar storage format for efficiency. |
| Alerting | Discord | Notification system for DAG failure alerts. |
The pipeline is executed in three main logical steps, managed by Airflow, which orchestrates the entire flow:
- Data Source: NYC TLC Parquet files (publicly available on the web).
- Process: An Airflow DAG is triggered to fetch the raw data.
- Pandas is used to read the Parquet files into memory, perform minimal schema validation, and ensure data quality.
- The cleaned raw data is written to Google Cloud Storage (GCS), serving as the raw data lake/staging area.
- Target:
Dataset: Rawtable in Google BigQuery. - Mechanism: Airflow triggers a BigQuery load job to move data from GCS into the BigQuery
Rawtable. - Incremental Logic: The loading process uses a
MERGEstatement, which is crucial for idempotent and efficient updates.- It checks for data using a
row_hashto only insert new records or update changed ones, ensuring data quality and reducing processing cost.
- It checks for data using a
- Tool:
dbtis executed incrementally via an Airflow operator (dbt run --incremental). - Transformation Logic: dbt runs a series of SQL models to transform the data:
- Prep Models: Clean, standardize, and filter the raw data.
- Data Warehouse (DW) Models: Apply business logic and join fact/dimension tables.
- Mart Models: Create highly aggregated, optimized tables (e.g.,
dim_dates,fact_trips_hourly) specifically for BI tools.
- Output: The transformed and modeled data is stored in the
Dataset: Prep, DW, Marttables in BigQuery, ready for consumption.
- Docker & Docker Compose: Must be installed to run the containerized services.
- Google Cloud Platform (GCP) Account: Required for BigQuery and GCS.
- GCP Service Account: A JSON key file with the necessary permissions (
BigQuery Data Editor,Storage Object Admin).
- Clone the repository:
git clone https://github.com/ramadiansyah/nyc-taxi-airflow-dbt-gcp.git cd nyc-taxi-airflow-dbt-gcp - Configure Environment:
- Place your GCP Service Account JSON key file in the appropriate directory (e.g.,
dags/keys/gcp_key.json). - Update the
.envfile with your GCP Project ID, GCS Bucket Name, and Discord Webhook URL.
- Place your GCP Service Account JSON key file in the appropriate directory (e.g.,
- Initialize & Run: Use Docker Compose to build the images and start the Airflow services:
docker-compose up --build -d
- Airflow Access: Once services are running, access the Airflow UI at
http://localhost:8080(Default credentials:airflow/airflow).
Before running the DAG, ensure the following are configured in your Airflow Connections and GCP:
- Airflow Connection: Create a Google Cloud connection named
google_cloud_defaultusing your Service Account JSON key. - BigQuery Datasets: Manually create the following datasets in BigQuery (or configure the DAG to create them):
nyc_taxi_rawnyc_taxi_dw
- dbt Profile: The dbt profile (
profiles.yml) must be configured to connect to your BigQuery project, typically by inheriting credentials from the Docker container's environment variables.
In the event of a critical pipeline failure, a Discord webhook is utilized to immediately send a notification, ensuring the data engineering team is alerted and can address the issue promptly.
