A comprehensive data engineering project analyzing passenger train performance across the Dutch railway network. This pipeline processes historical stop and service data to generate actionable insights into delays, cancellations, and platform changes.
data/: Documentation on data sources and a detailed data dictionary.pipeline_code/: The core Databricks logic, organized into a Medallion architecture.visuals/: Screenshots and a demo video of the final performance dashboard.
We use a multi-layered approach to transform raw data into insights:
- Bronze (Raw): Raw CSV ingestion with schema evolution and basic sanitization.
- Silver (Cleaned): Data typing, cleaning, and enrichment. Includes derived on-time flags (threshold <= 5 min) and performance classification.
- Gold (Business): Optimized dimensional models (
fact_stops,dim_station) and daily performance aggregations for reporting.
The pipeline feeds a dashboard that tracks KPIs like:
- Arrival/Departure On-Time %
- Cancellation Rates
- Platform Change Severity
- Peak Hour Performance (Morning vs. Evening Rush)
Check out the visuals folder for more breakdowns.
Data is curated from the NS API by Rijden de Treinen. You can find more details in how_to_get_data.md.