dsai4-m2-t2-citycycle
End-to-end ELT pipeline for the London Bicycle Sharing dataset, built for the CityCycle operations team to solve the bike rebalancing problem using data engineering, ML forecasting, and interactive dashboards.
- Business Problem
- Solution Overview
- Architecture
- Tech Stack
- Repository Structure
- Getting Started
- Mock Data Strategy (Free Tier Protection)
- Pipeline Walkthrough
- Key Findings (Live Data)
- Exogenous Factors Affecting Ridership
- Risks & Mitigations
- Future Improvements & Next Steps
- Contributing
London's CityCycle bike-sharing network operates 795 active docking stations across the city, processing millions of rides annually. The core operational challenge is bike rebalancing: stations run empty (stranded demand) or overflow (no docks to return), leading to:
- Lost revenue from unfulfilled rentals
- Increased operational costs for manual rebalancing crews
- Poor customer experience and negative NPS
- Inefficient fleet utilisation across the network
Goal: Build an intelligent, data-driven pipeline that ingests ride history, detects imbalance patterns, forecasts demand per station, and visualises actionable rebalancing recommendations in near real-time.
BigQuery Public Data → Meltano Ingest → BQ Raw → dbt Transform
→ Great Expectations Quality Gate → ML Demand Forecast
→ Streamlit Dashboard + Looker Studio Report
(CI/CD orchestrated by GitHub Actions · 5 jobs · push-triggered)
The pipeline follows a medallion-style architecture:
- Bronze (
raw.*): Raw tables ingested from BigQuery public dataset via Meltano - Silver (
staging.*): Cleaned, typed, validated tables via dbt staging models - Gold (
marts.*): Star schema fact/dimension tables for analytics and ML
| Layer | Tool | Purpose |
|---|---|---|
| Ingestion | Meltano (tap-bigquery → target-bigquery) | Singer-protocol EL from source to raw |
| Warehouse | Google BigQuery | Cloud data warehouse, star schema |
| Transform | dbt Core | SQL-based ELT, lineage, testing |
| Quality | Great Expectations | Expectation suites, checkpoints, data docs |
| Analysis | Python / pandas / scikit-learn | EDA, feature engineering, ML |
| Dashboard | Streamlit | Interactive ops dashboard + geospatial map |
| BI Reporting | Looker Studio | Executive KPI report (BQ connector) |
dsai4-m2-t2-citycycle/
├── .github/
│ └── workflows/
│ └── ci.yml # GitHub Actions: lint, mock-data, dbt-compile, train-model, notebook
├── ingestion/
│ ├── meltano.yml # Meltano project config (tap-bigquery → target-bigquery)
│ ├── load_mock.py # Python loader: mock CSV → BigQuery (dry-run + live)
│ ├── load_live_stations.py # One-time loader: stations from BQ public dataset
│ └── bq_cost_guard.py # Query cost guard: dry-run estimates + monthly budget tracking
├── transform/
│ ├── dbt_project.yml # dbt project config
│ ├── profiles_template.yml # profiles.yml template (DO NOT commit real profiles.yml)
│ ├── models/
│ │ ├── staging/
│ │ │ ├── stg_cycle_hire.sql # Clean + type raw ride data
│ │ │ ├── stg_cycle_stations.sql # Clean stations, add zone + capacity_tier
│ │ │ └── _staging.yml # 25 schema tests
│ │ ├── intermediate/
│ │ │ ├── int_rides_enriched.sql # Join rides + stations, add flags
│ │ │ └── int_station_daily_stats.sql # Daily imbalance per station
│ │ └── marts/
│ │ ├── dim_stations.sql # Station dimension with rebalancing priority
│ │ ├── dim_date.sql # Date spine 2015–2025
│ │ ├── fact_rides.sql # 32.3M rows, partitioned by hire_date
│ │ └── _marts.yml # 31 schema tests (56 PASS · 0 ERROR · 3 WARN in last run)
│ ├── macros/
│ │ └── generate_surrogate_key.sql
│ └── tests/
│ └── assert_ride_duration_positive.sql
├── quality/
│ ├── checkpoints/
│ │ └── post_ingest.yml # GE checkpoint config
│ ├── expectations/
│ │ └── suites/
│ │ ├── raw_cycle_hire.json
│ │ └── fact_rides.json
│ ├── run_ge_checks.py # 34 custom SQL checks: 30 PASS · 4 WARN · 0 FAIL
│ └── ge_results.json # Last run results (evidence)
├── orchestration/
│ ├── workspace.yaml # Dagster scaffold (reference only — CI uses GitHub Actions)
│ ├── assets/
│ │ ├── ingestion_assets.py
│ │ ├── transform_assets.py
│ │ └── quality_assets.py
│ └── jobs/
│ └── citycycle_pipeline_job.py
├── analysis/
│ └── notebooks/
│ ├── 01_eda_mock_data.ipynb # Initial EDA on mock data
│ └── 03_bq_eda_live_data.ipynb # Live BQ EDA via SQLAlchemy (32M rows)
├── ml/
│ └── models/
│ └── train_demand_model.py # 3-model comparison: Linear Reg · Random Forest · XGBoost
├── dashboard/
│ ├── app.py # Streamlit entry point
│ ├── pages/
│ │ ├── 01_overview.py # KPIs + daily trend + hourly demand
│ │ ├── 02_station_map.py # pydeck 3D + folium detailed map
│ │ ├── 03_rebalancing.py # Intervention list + crew runs estimate
│ │ ├── 04_forecast.py # 24h XGBoost demand forecast
│ │ └── 05_scenario.py # Guided scenario planner — corridor + dispatch
│ └── utils/
│ ├── bq_client.py # BQ connection via cost guard
│ └── mock_data_generator.py # Synthetic data generator (CI-safe)
├── data/
│ └── mock/
│ ├── cycle_hire_mock.csv # 10K synthetic rides (CI + dev)
│ └── cycle_stations_mock.csv # 795 station records
├── docs/
│ └── diagrams/
│ └── dataflow_diagram.png # Architecture diagram
├── .env.example # Template for env vars (no secrets)
├── .gitignore
├── requirements.txt
└── README.md
- Python 3.10+
- Google Cloud account with BigQuery access
gcloudCLI authenticated- Node.js 18+ (for pptxgenjs, optional)
git clone https://github.com/YOUR_ORG/dsai4-m2-t2-citycycle.git
cd dsai4-m2-t2-citycycle
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env — add your GCP project ID, BQ dataset names, etc.
# NEVER commit .env to GitBefore touching BigQuery's live data, validate the full pipeline with local mock data:
# Generate mock data
python dashboard/utils/mock_data_generator.py
# Load mock CSV into BigQuery (raw schema)
python ingestion/load_mock.py --mode=mock
# Run dbt transformations
cd transform && dbt run --target dev
# Run quality checks
python quality/run_ge_checks.py
# Launch dashboard
streamlit run dashboard/app.pyOnce validated on mock data, switch to live ingestion:
# Meltano ingest from BQ public dataset
cd ingestion && meltano run tap-bigquery target-bigquery
# Then continue with dbt + GE as above
# See .github/workflows/ci.yml for the full CI pipelineBigQuery's free tier provides 1 TB of query processing per month. The cycle_hire table has 83 million rows. A single unguarded SELECT * could consume the entire monthly quota instantly.
| Risk | Mitigation |
|---|---|
Full-table scan on cycle_hire |
LIMIT clauses on all dev queries; partitioned by hire_date |
Accidental SELECT * |
dbt +limit macro in dev profile; BQ slot quota set |
| Exceeding 1 TB free tier | Dry-run cost estimates before every query; budget alert at 80% |
| Development iteration cost | All development runs against data/mock/ CSV files |
| CI/CD test cost | GitHub Actions uses mock data only; no live BQ calls in CI |
The mock data mirrors the exact schema of the public BigQuery tables:
cycle_hire_mock.csv → bike_id, rental_id, duration, start_date,
start_station_id, start_station_name,
end_date, end_station_id, end_station_name
cycle_stations_mock.csv → id, install_date, installed, latitude,
locked, longitude, name, nbdocks,
temporary, terminal_name
Meltano uses the Singer protocol (tap → target) to extract data from BigQuery and load it into the raw dataset.
- tap-bigquery: Reads from
bigquery-public-data.london_bicycles - target-bigquery: Writes to your project's
rawdataset - Supports full refresh and incremental loads (state-based on
start_date)
meltano run tap-bigquery target-bigqueryStar schema optimised for ride analytics and rebalancing queries:
Fact Table:
fact_rides— one row per ride: duration, start/end station FK, date FK, hour, day-of-week, imbalance signals
Dimension Tables:
dim_stations— station metadata: name, location (lat/lon), dock capacity, zone, rebalancing priority tierdim_date— date spine 2015–2025: year, month, week, is_weekend, is_holiday (UK bank holidays 2023–2024)
raw.cycle_hire
└── stg_cycle_hire (cast types, rename columns, parse timestamps, filter bad rows)
└── int_rides_enriched (join stations, add peak_hour_flag, duration_band, time_period)
└── fact_rides (final fact table, join imbalance signals, rolling_7d_avg)
raw.cycle_stations
└── stg_cycle_stations (clean nulls, add zone via lat/lon bounding boxes, capacity_tier)
└── dim_stations (final dimension, all-time imbalance stats, rebalancing_priority)
Materialisation strategy:
staging/andintermediate/→ views (zero storage cost, always fresh)marts/→ tables (materialised for fast dashboard queries)fact_rides→ additionally partitioned byhire_dateand clustered bystart_station_id,end_station_idfor cost-efficient rebalancing queries
Derived columns generated in dbt (selected key fields):
| Field | Layer | Formula |
|---|---|---|
duration_minutes |
intermediate | duration_seconds / 60.0 |
peak_hour_flag |
intermediate | 1 if start_hour IN (7, 8, 17, 18) else 0 |
duration_band |
intermediate | short <10 min · medium 10–30 · long 30–60 · extended >60 |
time_period |
intermediate | am_peak · pm_peak · midday · evening · night |
is_round_trip |
intermediate | TRUE if start_station_id = end_station_id |
net_flow |
intermediate | total_departures - total_arrivals per station per day |
imbalance_score |
intermediate | ABS(net_flow) / MAX(departures + arrivals, 1) — range 0–1 |
is_imbalanced |
intermediate | TRUE if imbalance_score > 0.20 |
rebalancing_priority |
marts | CRITICAL ≥0.25 · HIGH ≥0.18 · MEDIUM ≥0.10 · LOW <0.10 |
rolling_7d_avg |
marts | 7-day rolling average departures per station — ML feature |
ride_sk / station_sk |
marts | Surrogate keys via dbt_utils.generate_surrogate_key |
dbt tests: 56 PASS · 0 ERROR · 3 WARN (intentional severity: warn on nullable FK fields)
Two checkpoint stages:
Post-ingest checkpoint (raw.*):
rental_idnot null, uniquestart_date> '2010-01-01'durationbetween 60 and 86400 secondsstart_station_idin valid station list
Post-transform checkpoint (fact_rides, dim_stations):
- No orphan station FK references
duration_minutesbetween 1 and 1440start_station_is_imbalancedonly TRUE/FALSE- Null rate < 5% on all key columns
Results: 30 PASS · 4 WARN · 0 FAIL. Results published as HTML data docs.
01_eda_mock_data.ipynb— initial pipeline validation on synthetic data03_bq_eda_live_data.ipynb— full EDA on 32M rides via SQLAlchemy + BigQuery
Key findings from EDA:
- Monthly and hourly ride trends with COVID-19 signal
- Weekday double-peak confirmed at 08:00 and 17:00–18:00
- K-Means k=3 customer segmentation: Leisure 53% · Casual 32% · Commuter 15%
- Station-level imbalance ranking: 3 CRITICAL, 27 HIGH priority stations identified
- Features:
hour,day_of_week,is_weekend,is_holiday,season,start_station_id,rolling_7d_avg - Target: hourly departures per station
- Models tested: Linear Regression (baseline) · Random Forest · XGBoost
- Split: 80/20 train/test · ~10.4M feature rows · no cross-validation
- Best model: XGBoost — RMSE 2.422 · MAE 1.508 · R² 0.488
CityCycle uses two complementary orchestration layers:
Runs 5 jobs on every push to main:
push to main
│
├── lint (ruff + black — code style enforcement)
├── mock-data (generate + validate mock CSV files)
├── dbt-compile (dbt compile + dbt test against mock data)
├── train-model (train XGBoost on mock data, validate RMSE)
└── notebook (validate notebook structure)
Dagster manages the data pipeline as software-defined assets with explicit dependency tracking, metadata logging, and a visual asset graph. The pipeline runs end-to-end against mock data as a proof of concept. In production, mock_bq_load_asset would be replaced by meltano_ingest_asset to trigger live BigQuery ingestion.
Asset dependency graph:
Pipeline execution order:
mock_data_asset (generate 10K synthetic rides + stations)
└── mock_bq_load_asset (validate CSV schema — zero BQ cost)
├── post_ingest_ge_asset ← quality gate 1 (14 checks)
└── dbt_compile_asset (validate 7 dbt models compile)
└── dbt_test_asset (run 57 dbt schema tests)
└── post_transform_ge_asset ← quality gate 2
Run the pipeline locally:
pip install dagster dagster-webserver
dagster dev -f orchestration/jobs/citycycle_pipeline_job.py
# Open http://localhost:3000 → click Materialize allProduction architecture (future): Replace mock_bq_load_asset with meltano_ingest_asset to run the full pipeline against live BigQuery on the daily 02:00 UTC schedule defined in citycycle_daily_02utc.
| Page | File | Description |
|---|---|---|
| Overview | 01_overview.py |
Daily ride KPIs, imbalance score, fleet utilisation |
| Station Map | 02_station_map.py |
Pydeck geospatial map of all 795 stations, colour-coded by priority (CRITICAL · HIGH · MEDIUM · LOW) |
| Rebalancing | 03_rebalancing.py |
Ranked list of stations needing intervention, with predicted demand delta |
| Forecast | 04_forecast.py |
24h XGBoost demand forecast per station |
| Scenario Planner | 05_scenario.py |
Guided 6-step operational crew dispatch workflow (see below) |
Scenario Planner — 05_scenario.py
The scenario planner is the operational heart of the dashboard — a guided 6-step workflow designed for a rebalancing operations manager planning daily crew dispatch:
- Select date range — Map and table filter dynamically to show imbalance patterns for that specific window, not all-time averages
- Identify danger zone stations — 795 stations colour-coded by priority; red/orange concentration in inner London immediately visible
- Filter to CRITICAL and HIGH — Narrows to the 30 stations needing daily attention; table shows imbalance score, net flow direction, and bikes needed
- Understand draining vs filling — Draining = deliver bikes; Filling = collect bikes; quantity derived directly from
abs(net_flow)— no manual calculation - Review the dispatch list — Filtered table becomes the crew morning briefing; production-ready output
- Forecast hourly demand — XGBoost predicts departures per station per hour; RMSE 2.422 — sufficient for crew scheduling decisions
# Run the full dashboard
streamlit run dashboard/app.py
# Or open the static scenario planner directly (no BQ connection needed)
open docs/dashboard/05_scenario_static.htmlConnected directly to BigQuery citycycle_dev_marts via live connector. Includes:
- KPI scorecard: total rides · avg ride duration · avg imbalance score
- Rider segments: Casual 46% · Leisure 29% · Commuter 24% (approximated from ride behaviour — weekend flag + peak hour flag)
- Daily ride trend: Monthly time series 2020–2023 with COVID signal visible
- Station rebalancing tables: Draining and filling stations ranked by imbalance score
- Date range filter: Fixed to
2020-01-01→2023-01-15— dataset does not extend to present
📄 Looker Studio PDF export:
docs/presentation/lookerstudio_CityCycle_London_Ops.pdf
These findings are based on 32,342,086 real rides from
bigquery-public-data.london_bicyclesingested into the CityCycle data warehouse and analysed via the full ELT pipeline.
| Metric | Value | Insight & Business Implication |
|---|---|---|
| Total rides (2020–2023) | 32,342,086 | 3 years of real operational data — sufficient for trend detection and ML model training |
| Average ride duration | 21.8 minutes | Mixed use confirmed — commuter and leisure trips coexist across the network |
| Rider segments | Leisure 53% · Casual 32% · Commuter 15% | K-Means k=3 — each segment drains stations in different directions at different times |
| Weekend rides | 9.47M (29.3%) | Strong leisure demand — different rebalancing strategy required Sat/Sun vs weekdays |
| Avg network imbalance score | 0.084 | Network-wide baseline — scores above 0.18 trigger scheduled crew intervention |
| Critical stations | 3 stations | Score ≥ 0.25: New North Road Hoxton · Ladbroke Grove Central · Cloudesley Road Angel |
| High priority stations | 27 stations | Score ≥ 0.18 — require scheduled daily intervention by rebalancing crews |
| XGBoost RMSE | 2.422 rides/hr | Best of 3 models tested — within operational planning threshold for crew dispatch |
| Top draining station | New North Road Hoxton | Score 0.324 · Net flow +7.0 bikes/day · Draining 90% of days — needs daily pre-AM crew |
| COVID-19 signal | Visible in 2020 | Clear ridership collapse March–May 2020, full recovery to pre-pandemic levels by mid-2022 |
Ride demand across 2020–2023 is shaped by two compounding forces outside the pipeline's control: COVID-19 lockdown waves and London's natural winter cycling seasonality. Neither appears as a feature in the ML model, which is why XGBoost explains 49% of variance — the remaining 51% is largely attributable to these exogenous signals.
| Period | Type | Impact |
|---|---|---|
| Mar–Jun 2020 | 🔴 Lockdown 1 | ~500K rides/month — first national lockdown |
| Nov 2020–Jan 2021 | 🔴 Lockdown 2 + 3 | ~400K rides/month — deepest trough in dataset |
| Dec 2021–Jan 2022 | 🟡 Omicron / Plan B | ~650K — work from home reintroduced |
| Jan, Dec each year | 🔵 Winter seasonality | Natural trough — cold weather, short daylight hours |
Despite pandemic disruption, each subsequent summer peak exceeded the last — ~1.1M in summer 2020, ~1.2M in 2021, reaching 1,302,994 in July 2022 — confirming underlying demand recovered and grew year-on-year.
📊 The chart below is also available as a standalone file:
docs/charts/chart_ride_trends_annotated.html
📈 View Ride Trends 2020–2023 (annotated)
Open
docs/charts/chart_ride_trends_annotated.htmlin a browser for the full interactive chart with lockdown bands, winter seasonality overlays, and hover tooltips per month.
Key data points directly from citycycle_dev_marts.fact_rides:
| Month | Rides | Event |
|---|---|---|
| Jan 2020 | ~700K | Pre-pandemic winter baseline |
| Apr 2020 | ~500K | Lockdown 1 trough (−30%) |
| Jul 2020 | ~1.1M | Post-lockdown summer recovery |
| Jan 2021 | ~400K | Deepest trough — Lockdown 3 + winter seasonality compounding |
| Jul 2022 | 1,302,994 | Dataset peak — full recovery confirmed |
| Jan 2023 | ~240K | End of dataset winter trough (partial month) |
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| BigQuery free tier exceeded | Medium | High | Mock data dev; LIMIT guards; dry-run estimates; budget alerts |
| Meltano tap-bigquery schema drift | Low | Medium | dbt schema tests; GE not-null/type checks catch regressions |
| Long BQ query runtime in CI | Medium | Medium | CI uses mock CSV only; no live BQ in GitHub Actions |
| ML model staleness | Medium | Medium | Retrain script in train_demand_model.py; model versioned in ml/models/ |
| Dashboard downtime | Low | Low | Streamlit caches last-good result; graceful error states |
| Credentials leaked to Git | Low | Critical | .gitignore covers all credential patterns; .env.example only |
| Field | Type | Description |
|---|---|---|
rental_id |
INT64 | Unique identifier for each ride |
bike_id |
INT64 | Identifier of the bike used |
duration |
INT64 | Ride duration in seconds |
start_date |
TIMESTAMP | Date and time the ride began |
end_date |
TIMESTAMP | Date and time the ride ended |
start_station_id |
INT64 | ID of the station where the bike was hired |
start_station_name |
STRING | Name of the hire station |
end_station_id |
INT64 | ID of the station where the bike was returned (nullable — ~312K lost/unreturned bikes) |
end_station_name |
STRING | Name of the return station |
| Field | Type | Description |
|---|---|---|
id |
INT64 | Unique station identifier |
name |
STRING | Station name and location description |
terminal_name |
STRING | Physical terminal code on the docking unit |
latitude |
FLOAT64 | Station latitude (WGS84) |
longitude |
FLOAT64 | Station longitude (WGS84) |
docks_count |
INT64 | Number of physical docking points at the station |
installed |
BOOL | Whether the station is currently installed |
locked |
BOOL | Whether the station is locked/out of service |
temporary |
BOOL | Whether the station is a temporary installation |
install_date |
DATE | Date the station was installed |
Approximately half of all fields in the star schema are engineered features derived from raw source data — not simply loaded from BigQuery. The table below summarises every calculated field and its formula across all three layers.
| Field | Formula |
|---|---|
hire_date |
DATE(start_datetime) |
start_hour |
EXTRACT(HOUR FROM start_datetime) |
day_of_week |
EXTRACT(DAYOFWEEK FROM start_datetime) — 1=Sunday, 7=Saturday |
is_weekend |
day_of_week IN (1, 7) |
duration_minutes |
duration_seconds / 60.0 |
duration_band |
short <10 min · medium 10–30 · long 30–60 · extended >60 |
peak_hour_flag |
1 if start_hour IN (7, 8, 17, 18) else 0 |
time_period |
am_peak (07–09) · pm_peak (17–19) · midday · evening · night |
season |
spring (Mar–May) · summer (Jun–Aug) · autumn (Sep–Nov) · winter (Dec–Feb) |
is_round_trip |
TRUE if start_station_id = end_station_id |
ride_sk |
Surrogate key — dbt_utils.generate_surrogate_key(['rental_id']) |
net_flow |
total_departures - total_arrivals per station per day (joined from int_station_daily_stats) |
imbalance_score |
ABS(departures - arrivals) / MAX(departures + arrivals, 1) — range 0 to 1 |
imbalance_direction |
draining (net_flow > 0) · filling (net_flow < 0) · balanced (net_flow = 0) |
rebalancing_priority |
CRITICAL (≥0.25) · HIGH (≥0.18) · MEDIUM (≥0.10) · LOW (<0.10) |
rolling_7d_avg |
7-day rolling average of departures per station — used as ML feature |
| Field | Formula |
|---|---|
zone |
London area classification derived from lat/lon bounding boxes |
capacity_tier |
small (≤15 docks) · medium (≤24 docks) · large (>24 docks) |
avg_imbalance_score_7d |
All-time average imbalance score across full dataset |
rebalancing_priority |
Same formula as fact_rides — based on all-time avg imbalance score |
total_departures_all_time |
COUNT(*) of rides departing from station across full dataset |
total_arrivals_all_time |
COUNT(*) of rides arriving at station across full dataset |
station_sk |
Surrogate key — dbt_utils.generate_surrogate_key(['station_id']) |
| Field | Formula |
|---|---|
year |
EXTRACT(YEAR FROM full_date) |
month |
EXTRACT(MONTH FROM full_date) |
day |
EXTRACT(DAY FROM full_date) |
week_num |
EXTRACT(WEEK FROM full_date) |
day_of_week |
EXTRACT(DAYOFWEEK FROM full_date) |
is_weekend |
day_of_week IN (1, 7) |
season |
Same CASE logic as fact_rides |
is_uk_bank_holiday |
Hardcoded for 2023–2024 only — FALSE for all 2020–2022 rides |
Fields sourced directly from raw data (not calculated):
rental_id,bike_id,start_datetime,end_datetime,start_station_id,end_station_id,station_name,latitude,longitude,nb_docks,is_installed,is_locked,zone,terminal_name
All raw fields are retained and the following are added or renamed:
| Field | Type | Source | Description |
|---|---|---|---|
rental_id |
INT64 | raw | Cast to INT64, null rows removed |
bike_id |
INT64 | raw | Cast to INT64 |
start_datetime |
TIMESTAMP | start_date |
Renamed and cast to TIMESTAMP |
end_datetime |
TIMESTAMP | end_date |
Renamed and cast to TIMESTAMP |
duration_seconds |
INT64 | duration |
Renamed to make unit explicit |
hire_date |
DATE | Calculated | DATE(start_datetime) — extracts the calendar date for partitioning and daily aggregations |
start_hour |
INT64 | Calculated | EXTRACT(HOUR FROM start_datetime) — hour of day (0–23) used for temporal analysis and ML features |
day_of_week |
INT64 | Calculated | EXTRACT(DAYOFWEEK FROM start_datetime) — 1=Sunday … 7=Saturday |
is_weekend |
BOOL | Calculated | TRUE if day_of_week IN (1, 7) — used to split commuter vs leisure demand patterns |
Row filters applied in staging:
rental_id IS NOT NULLduration_seconds BETWEEN 60 AND 86400end_datetime > start_datetime
These filters reduce 32,369,326 raw rows to 32,342,086 in the final fact table.
| Field | Type | Source | Description |
|---|---|---|---|
station_id |
INT64 | id |
Renamed for consistency |
station_name |
STRING | name |
Renamed for clarity |
terminal_name |
STRING | raw | Physical terminal code |
latitude |
FLOAT64 | raw | Cast to FLOAT64 |
longitude |
FLOAT64 | raw | Cast to FLOAT64 |
nb_docks |
INT64 | docks_count |
Number of docking points |
is_installed |
BOOL | installed |
Renamed for consistency |
is_locked |
BOOL | locked |
Renamed for consistency |
is_temporary |
BOOL | temporary |
Renamed for consistency |
install_date |
DATE | raw | Cast to DATE |
zone |
STRING | Calculated | London area classification based on lat/lon bounding boxes: City & Shoreditch, Westminster & Victoria, Waterloo & Southbank, Camden & Islington, East End & Canary Wharf, Kensington & Chelsea, Other. |
capacity_tier |
STRING | Calculated | Station size classification: small (≤15 docks), medium (≤24 docks), large (>24 docks). |
Joins stg_cycle_hire with stg_cycle_stations (twice — once for start, once for end station) and adds business logic flags:
| Field | Type | Description |
|---|---|---|
duration_minutes |
FLOAT64 | Calculated — duration_seconds / 60.0 |
duration_band |
STRING | Calculated — short (<10 min), medium (10–30 min), long (30–60 min), extended (>60 min) |
peak_hour_flag |
INT64 | Calculated — 1 if start_hour IN (7, 8, 17, 18), else 0. Marks London commuter peak hours. Core ML feature. |
time_period |
STRING | Calculated — am_peak, pm_peak, midday, evening, night |
is_round_trip |
BOOL | Calculated — TRUE if start_station_id = end_station_id |
start_zone |
STRING | Joined from stg_cycle_stations — zone of the departure station |
start_lat / start_lon |
FLOAT64 | Joined — coordinates for geospatial mapping |
start_nb_docks |
INT64 | Joined — dock capacity of the departure station |
start_capacity_tier |
STRING | Joined — size tier of the departure station |
end_zone |
STRING | Joined from stg_cycle_stations — zone of the return station |
end_lat / end_lon |
FLOAT64 | Joined — coordinates of return station |
Aggregates ride data to one row per station per day, computing the core rebalancing metrics:
| Field | Type | Description |
|---|---|---|
hire_date |
DATE | Calendar date |
station_id |
INT64 | Station identifier |
total_departures |
INT64 | Calculated — count of rides starting at this station on this date |
total_arrivals |
INT64 | Calculated — count of rides ending at this station on this date |
net_flow |
INT64 | Calculated — total_departures - total_arrivals. Positive = draining, negative = filling |
imbalance_score |
FLOAT64 | Calculated — ABS(net_flow) / MAX(departures + arrivals, 1). Normalised 0–1 score. |
is_imbalanced |
BOOL | Calculated — TRUE if imbalance_score > 0.20 |
imbalance_direction |
STRING | Calculated — draining, filling, or balanced |
utilisation_rate |
FLOAT64 | Calculated — (departures + arrivals) / (nb_docks × 2) |
peak_departures |
INT64 | Calculated — departures during peak hours only |
One row per ride, partitioned by hire_date (day granularity), clustered by start_station_id, end_station_id. Combines all enriched ride fields with station-level imbalance signals joined from int_station_daily_stats:
| Field | Type | Description |
|---|---|---|
ride_sk |
STRING | Calculated — surrogate key via dbt_utils.generate_surrogate_key(['rental_id']) |
start_station_imbalance_score |
FLOAT64 | Joined — imbalance score of the departure station on the ride date |
start_station_is_imbalanced |
BOOL | Joined — whether the departure station was flagged as imbalanced |
start_station_imbalance_direction |
STRING | Joined — draining, filling, or balanced |
start_station_net_flow |
INT64 | Joined — net bike flow at the departure station on the ride date |
start_station_utilisation_rate |
FLOAT64 | Joined — utilisation rate of the departure station |
start_station_rolling_7d_avg |
FLOAT64 | Joined — 7-day rolling average demand, used as ML feature |
One row per station with all-time average imbalance metrics:
| Field | Type | Description |
|---|---|---|
station_sk |
STRING | Calculated — surrogate key |
avg_imbalance_score_7d |
FLOAT64 | Calculated — all-time average imbalance score (uses full dataset average; named _7d for historical reasons) |
rebalancing_priority |
STRING | Calculated — CRITICAL (≥0.25), HIGH (≥0.18), MEDIUM (≥0.10), LOW (<0.10) |
total_departures_all_time |
INT64 | Calculated — cumulative departures across the full dataset |
total_arrivals_all_time |
INT64 | Calculated — cumulative arrivals across the full dataset |
last_activity_date |
DATE | Most recent date with recorded activity at this station |
Date spine from 2015 to 2025:
| Field | Type | Description |
|---|---|---|
date_id |
INT64 | Primary key — YYYYMMDD integer |
full_date |
DATE | Full date value |
year |
INT64 | Calendar year |
month |
INT64 | Month number (1–12) |
week_number |
INT64 | ISO week number |
day_of_week |
INT64 | Day number (1=Sunday … 7=Saturday) |
is_weekend |
BOOL | TRUE for Saturday and Sunday |
season |
STRING | spring, summer, autumn, winter based on month |
is_uk_bank_holiday |
BOOL | TRUE for UK bank holidays — hardcoded for 2023–2024 only |
The current pipeline establishes the core ELT infrastructure, imbalance detection, and demand forecasting. The following extensions would move it toward a full production rebalancing system:
| # | Improvement | Description |
|---|---|---|
| 01 | Enrich the feature set | Integrate TfL's live BikePoint API for real-time dock occupancy, Met Office weather data, and UK event calendars. Target: push R² from 0.488 toward 0.70+ by capturing exogenous demand signals. |
| 02 | Segment-aware rebalancing schedules | Operationalise the K-Means segmentation — automatically tag stations by dominant rider type and generate differentiated crew schedules: pre-AM stocking for commuter stations on weekdays, mid-morning restocking for leisure stations on weekends. |
| 03 | Predictive dock capacity alert | Surface same-day shortfall alerts when the XGBoost forecast predicts more departures than current bike count — before the imbalance accumulates over multiple days. |
| 04 | Expand to other cities | The pipeline is city-agnostic. Pointing the Meltano tap at Dublin Bikes, New York Citi Bike, or Paris Vélib public datasets replicates the full intelligence pipeline with minimal rework. |
| 05 | Automate the full production pipeline | Activate the Dagster citycycle_daily_02utc schedule — replace mock_bq_load_asset with meltano_ingest_asset to run nightly ingestion, transformation, and quality checks automatically. |
- Fork and create a feature branch:
git checkout -b feat/your-feature - Develop against mock data only (
--target devin dbt) - Run
dbt testbefore committing - Open a PR against
main— CI will run linting and mock-data tests - Never commit
.env,profiles.yml, or any*keyfile*.json
DSAI4 Module 2 · Team 2 · March 2026


