This project demonstrates a simple ETL pipeline using Azure Databricks and PySpark. It reads NYC taxi data, performs basic cleaning and transformations, writes data to Delta Lake format, and queries for insights.
- Load CSV data
- Clean and transform using PySpark
- Write and read Delta Lake
- Visualize average trip distances
- Databricks (Community Edition or Azure)
- PySpark
- Delta Lake
databricks-demo-nyc-taxi/
├── notebooks/
│ └── nyc_taxi_etl_demo_simple.py
├── data/
│ └── nyc_taxi_sample.csv
│ └── get_sample_data.sh
├── README.md
├── requirements.txt
└── LICENSE
- Open the notebook in Databricks
- Attach it to a running cluster
- Upload
nyc_taxi_sample.csv
to DBFS - Run all cells
NYC TLC Trip Record Data:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Download 10,000 lines of sample data:
./data/get_sample_data.sh
MIT License