This project implements a fully containerized data engineering workflow for downloading, ingesting, transforming, validating, and preparing NYC Yellow Taxi Trip Data. It uses Apache Spark for computation, Apache Airflow for orchestration, MinIO as Amazon S3-compatible object storage, and DuckDB as an analytics warehouse. The entire environment is reproducible using Docker Compose and set to run locally.
The workflow simulates a modern data engineering architecture with dedicated layers:
- Local Layer – Raw TLC data downloaded directly to the local filesystem.
- Landing Layer (MinIO/S3) – Central storage for raw but structured data.
- Prepared Layer (MinIO/S3) – Cleaned, filtered and transformed datasets generated via Spark Pipeline.
- Warehouse Layer (DuckDB) – Analytical storage loaded from the Prepared layer.
- EDA Layer (JupyterLab) – Data exploration and downstream modeling using notebooks.
├── airflow
│ ├── dags
│ │ └── taxi_pipeline_dag.py
│ ├── Dockerfile
│ └── requirements.txt
├── config
│ └── storage_config.yaml
├── data
│ └── yellow_tripdata_2024-01.parquet
├── infra
│ └── docker
│ ├── notebook.Dockerfile
│ └── spark.Dockerfile
├── notebooks
│ ├── 01_raw_data_exploration.ipynb
│ └── 02_prepared_data_exploration.ipynb
├── spark_jobs
│ ├── ingest_landing.py
│ └── transform_prepared.py
├── warehouse
│ └── taxi.duckdb
├── README.md
├── docker-compose.yml
├── LICENSE
└── requirements.txt- Trip data: NYC TLC Yellow Taxi Trip Records (monthly Parquet files).
- Zone reference data: NYC TLC Taxi Zone Lookup Table (CSV).
This project is configured and tested for years up to and including 2024 (e.g., 2024-01).
The TLC continues to publish newer months/years, but schemas may change over time (for example, additional columns may appear), so treat 2025+ as “use at your own risk” unless you update your transformations accordingly.
mkdir -p data
# Example month (2024-01)
curl -L "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet" -o data/yellow_tripdata_2024-01.parquet
# Taxi zones
curl -L "https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv" -o data/taxi_zone_lookup.csv| Column | Type | Description |
|---|---|---|
| VendorID | int | TPEP provider code (vendor). |
| tpep_pickup_datetime | datetime | Trip meter start timestamp. |
| tpep_dropoff_datetime | datetime | Trip meter stop timestamp. |
| passenger_count | float/int | Passenger count (driver-entered). |
| trip_distance | float | Trip distance in miles (taximeter). |
| RatecodeID | float/int | Final rate code for the trip. |
| store_and_fwd_flag | string | “Store and forward” indicator. |
| PULocationID | int | Pickup TLC Taxi Zone ID. |
| DOLocationID | int | Drop-off TLC Taxi Zone ID. |
| payment_type | int | Payment type code. |
| fare_amount | float | Fare amount. |
| extra | float | Extras/surcharges (varies). |
| mta_tax | float | MTA tax. |
| tip_amount | float | Tip amount. |
| tolls_amount | float | Tolls amount. |
| improvement_surcharge | float | Improvement surcharge. |
| total_amount | float | Total amount charged. |
| congestion_surcharge | float | Congestion surcharge. |
| Airport_fee | float | Airport fee (may be null / not present in older years). |
Note: the TLC schema can vary slightly across years.
| Column | Type | Description |
|---|---|---|
| LocationID | int | Taxi Zone ID used by TLC trip files. |
| Borough | string | Borough name. |
| Zone | string | Zone name. |
| service_zone | string | Service zone label (e.g., “Yellow Zone”, “Boro Zone”). |
| Component | Responsibility |
|---|---|
| Airflow | Orchestrates the ETL pipeline end-to-end. |
| Spark | Performs ingestion and transformation into Landing and Prepared layers. |
| MinIO | S3-compatible storage for Landing and Prepared zones. |
| DuckDB | Analytical warehouse loaded from the Prepared layer. |
| JupyterLab | Exploration and EDA environment. |
| Docker Compose | Orchestrates and isolates the entire system. |
- Monthly trips:
s3://<bucket>/landing/taxi/year=YYYY/month=MM/
part-*.parquet
- Taxi zones:
s3://<bucket>/landing/reference/taxi_zones/
part-*.parquet
Prepared trips (enriched with zone attributes; partitioned by day):
s3://<bucket>/prepared/taxi/year=YYYY/month=MM/
pickup_date=YYYY-MM-DD/
part-*.parquet
Both paths are dynamically parameterized with year and month, which are provided by Airflow at runtime.
- Web UI:
http://localhost:8080 - Executes the
taxi_spark_pipelineDAG.
-
Runs ingestion and transformation:
spark_jobs/ingest_landing.pyspark_jobs/transform_prepared.py
- S3 endpoint:
http://localhost:9000 - Console UI:
http://localhost:9001
- Available at:
http://localhost:8888 - Includes DuckDB and local workspace.
The DAG is defined in:
airflow/dags/taxi_pipeline_dag.py
-
download_taxi_data
- Downloads the TLC Parquet file for a given year / month into
data/. - URL pattern:
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_YYYY-MM.parquet - Downloads the Taxi Zone Lookup CSV into
data/(only once, not partitioned):https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
- Downloads the TLC Parquet file for a given year / month into
-
ingest_to_landing
- Runs
ingest_landing.pyon Spark with--yearand--month. - Reads local Parquet from
data/. - Writes raw structured data into MinIO Landing:
- Taxi Zones (only once, not partitioned):
s3a://<bucket>/landing/reference/taxi_zones/- Monthly Trips:
s3a://<bucket>/landing/taxi/year=YYYY/month=MM/
- Runs
-
transform_to_prepared
-
Runs
transform_prepared.pyon Spark with--yearand--month. -
Reads Landing data for the selected month.
-
Cleans and enriches the data.
-
Joins the Taxi Zone Lookup Table twice:
PULocationID→ pickup zone attributes (PU_*)DOLocationID→ drop-off zone attributes (DO_*)
-
Writes the output to the Prepared layer (partitioned by
pickup_date):s3a://<bucket>/prepared/taxi/year=YYYY/month=MM/
-
-
load_to_duckdb
-
Uses DuckDB to read the Prepared Parquet for the selected month.
-
Creates schema and table if needed:
CREATE SCHEMA IF NOT EXISTS taxi.taxi; CREATE OR REPLACE TABLE taxi.taxi.trips_prepared AS SELECT * FROM read_parquet('<prepared_glob>');
-
The
taxi.taxi.trips_preparedtable is overwritten for each pipeline run. After each run, it contains only the data for the triggered year/month.
-
The DAG is manual only (schedule=None).
In the Airflow UI:
- Open the
nyc_taxi_pipelineDAG. - Click Trigger DAG.
- Provide the run configuration as JSON:
{
"year": 2024,
"month": 1
}If year/month are omitted, the DAG falls back to its default params (year=2024, month=1).
Responsibilities:
- Ingest monthly trips (local Parquet) →
landing/taxi/year=YYYY/month=MM/ - Ingest taxi zones (local CSV) →
landing/reference/taxi_zones/
Responsibilities:
- Read monthly trips from Landing.
- Apply basic filters (e.g., negative distances, invalid timestamps).
- Derive:
pickup_date,pickup_hourdropoff_datetrip_duration_minavg_mph
- Join taxi zones twice to add:
PU_Borough,PU_Zone,PU_service_zoneDO_Borough,DO_Zone,DO_service_zone
- Write Prepared data partitioned by
pickup_date.
The Prepared layer (and thus taxi.taxi.trips_prepared) contains:
- All raw trip fields (see Dataset section), in addition to:
| Column | Description |
|---|---|
| pickup_date | Date derived from tpep_pickup_datetime. |
| pickup_hour | Hour-of-day derived from tpep_pickup_datetime. |
| dropoff_date | Date derived from tpep_dropoff_datetime. |
| trip_duration_min | Trip duration in minutes. |
| avg_mph | Average trip speed in mph (distance / duration). |
| PU_Borough | Borough for pickup zone. |
| PU_Zone | Zone name for pickup zone. |
| PU_service_zone | Service zone label for pickup zone. |
| DO_Borough | Borough for drop-off zone. |
| DO_Zone | Zone name for drop-off zone. |
| DO_service_zone | Service zone label for drop-off zone. |
This project includes two Jupyter notebooks that support exploratory analysis of both the raw TLC data and the prepared data.
File: notebooks/01_raw_data_exploration.ipynb
-
Inspects and validates the raw TLC Parquet file in
data/. -
Uses pandas, including pandas built-in plotting (
DataFrame.plot/Series.plot). -
Typical checks:
- schema and dtypes
- missing values
- negative or unexpected values
- outliers and distribution profiling
- temporal patterns (e.g., by day/hour)
File: notebooks/02_prepared_data_exploration.ipynb
- Analyzes curated data after Spark transformation and DuckDB load.
- Connects to
warehouse/taxi.duckdband queriestaxi.taxi.trips_prepared. - Includes SQL-based aggregations.
All secrets and configuration are driven through a .env file in the project root.
The repository ships an .env.example file as a starting point. You must create your own .env and fill in values, including AIRFLOW_FERNET_KEY (which is not committed).
MINIO_ROOT_USER=...
MINIO_ROOT_PASSWORD=...
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
S3_BUCKET_NAME=...
S3_ENDPOINT_URL=http://minio:9000
AIRFLOW_ADMIN_USER=...
AIRFLOW_ADMIN_PASSWORD=...
AIRFLOW_ADMIN_EMAIL=...
AIRFLOW_FERNET_KEY=... # must be generated locally (see below)
TZ=UTC
The Fernet key is critical for Airflow to securely store connections and variables. Please follow the next section to generate it and set it in your
.env.
Airflow uses a Fernet key to encrypt sensitive values (e.g., connections, variables). You must generate a key and set AIRFLOW_FERNET_KEY in your .env.
From the project root:
docker compose run --rm airflow \
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"Copy the printed value and set it in .env:
AIRFLOW_FERNET_KEY=<paste-generated-key-here>
If your host Python has cryptography installed:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"git clone https://github.com/Yarroudh/dataengineering
cd dataengineering
cp .env.example .envEdit .env and fill in:
MINIO_ROOT_USER,MINIO_ROOT_PASSWORDAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEYS3_BUCKET_NAME,S3_ENDPOINT_URLAIRFLOW_ADMIN_USER,AIRFLOW_ADMIN_PASSWORD,AIRFLOW_ADMIN_EMAIL- Generate and set
AIRFLOW_FERNET_KEY(see section 12)
docker compose builddocker compose up -d| Service | URL |
|---|---|
| Airflow | http://localhost:8080 |
| MinIO Console | http://localhost:9001 |
| JupyterLab | http://localhost:8888 |
-
Open Airflow at
http://localhost:8080. -
Log in using the admin credentials from
.env. -
Open the
nyc_taxi_pipelineDAG. -
Click Trigger DAG.
-
Provide year/month, for example:
{ "year": 2024, "month": 1 } -
Monitor tasks:
download_taxi_dataingest_to_landingtransform_to_preparedload_to_duckdb
After a successful run, DuckDB (warehouse/taxi.duckdb) contains taxi.taxi.trips_prepared for the selected month.
-
Open
http://localhost:8888. -
Run:
notebooks/01_raw_data_exploration.ipynb(raw profiling)notebooks/02_prepared_data_exploration.ipynb(prepared data)
docker compose logs <service>-
Ensure Airflow dependencies include
duckdbandrequestsinairflow/requirements.txt. -
Rebuild Airflow:
docker compose build airflow docker compose up -d airflow
- Verify
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin.env. - Confirm the correct bucket name in
S3_BUCKET_NAME. - Ensure
S3_ENDPOINT_URLishttp://minio:9000(inside the Docker network).
- Add data quality checks (e.g., Great Expectations / Deequ).
- Implement incremental and multi-month ingestion.
- Add CI workflows.
- Introduce dataset versioning and data catalog integration.
This project is distributed under the terms defined in the LICENSE file.
- NYC Taxi & Limousine Commission for the dataset.
- Apache Spark, Apache Airflow, MinIO, and DuckDB communities for their open-source tools.


