🚲 CityCycle London — Bike Rebalancing Intelligence Pipeline

dsai4-m2-t2-citycycle
End-to-end ELT pipeline for the London Bicycle Sharing dataset, built for the CityCycle operations team to solve the bike rebalancing problem using data engineering, ML forecasting, and interactive dashboards.

Business Problem

London's CityCycle bike-sharing network operates 795 active docking stations across the city, processing millions of rides annually. The core operational challenge is bike rebalancing: stations run empty (stranded demand) or overflow (no docks to return), leading to:

Lost revenue from unfulfilled rentals
Increased operational costs for manual rebalancing crews
Poor customer experience and negative NPS
Inefficient fleet utilisation across the network

Goal: Build an intelligent, data-driven pipeline that ingests ride history, detects imbalance patterns, forecasts demand per station, and visualises actionable rebalancing recommendations in near real-time.

Solution Overview

BigQuery Public Data → Meltano Ingest → BQ Raw → dbt Transform
→ Great Expectations Quality Gate → ML Demand Forecast
→ Streamlit Dashboard + Looker Studio Report
(CI/CD orchestrated by GitHub Actions · 5 jobs · push-triggered)

Architecture

The pipeline follows a medallion-style architecture:

Bronze (raw.*): Raw tables ingested from BigQuery public dataset via Meltano
Silver (staging.*): Cleaned, typed, validated tables via dbt staging models
Gold (marts.*): Star schema fact/dimension tables for analytics and ML

Tech Stack

Layer	Tool	Purpose
Ingestion	Meltano (tap-bigquery → target-bigquery)	Singer-protocol EL from source to raw
Warehouse	Google BigQuery	Cloud data warehouse, star schema
Transform	dbt Core	SQL-based ELT, lineage, testing
Quality	Great Expectations	Expectation suites, checkpoints, data docs
Analysis	Python / pandas / scikit-learn	EDA, feature engineering, ML
Dashboard	Streamlit	Interactive ops dashboard + geospatial map
BI Reporting	Looker Studio	Executive KPI report (BQ connector)

Repository Structure

dsai4-m2-t2-citycycle/
├── .github/
│   └── workflows/
│       └── ci.yml                    # GitHub Actions: lint, mock-data, dbt-compile, train-model, notebook
├── ingestion/
│   ├── meltano.yml                   # Meltano project config (tap-bigquery → target-bigquery)
│   ├── load_mock.py                  # Python loader: mock CSV → BigQuery (dry-run + live)
│   ├── load_live_stations.py         # One-time loader: stations from BQ public dataset
│   └── bq_cost_guard.py              # Query cost guard: dry-run estimates + monthly budget tracking
├── transform/
│   ├── dbt_project.yml               # dbt project config
│   ├── profiles_template.yml         # profiles.yml template (DO NOT commit real profiles.yml)
│   ├── models/
│   │   ├── staging/
│   │   │   ├── stg_cycle_hire.sql    # Clean + type raw ride data
│   │   │   ├── stg_cycle_stations.sql # Clean stations, add zone + capacity_tier
│   │   │   └── _staging.yml          # 25 schema tests
│   │   ├── intermediate/
│   │   │   ├── int_rides_enriched.sql        # Join rides + stations, add flags
│   │   │   └── int_station_daily_stats.sql   # Daily imbalance per station
│   │   └── marts/
│   │       ├── dim_stations.sql      # Station dimension with rebalancing priority
│   │       ├── dim_date.sql          # Date spine 2015–2025
│   │       ├── fact_rides.sql        # 32.3M rows, partitioned by hire_date
│   │       └── _marts.yml            # 31 schema tests (56 PASS · 0 ERROR · 3 WARN in last run)
│   ├── macros/
│   │   └── generate_surrogate_key.sql
│   └── tests/
│       └── assert_ride_duration_positive.sql
├── quality/
│   ├── checkpoints/
│   │   └── post_ingest.yml           # GE checkpoint config
│   ├── expectations/
│   │   └── suites/
│   │       ├── raw_cycle_hire.json
│   │       └── fact_rides.json
│   ├── run_ge_checks.py              # 34 custom SQL checks: 30 PASS · 4 WARN · 0 FAIL
│   └── ge_results.json               # Last run results (evidence)
├── orchestration/
│   ├── workspace.yaml                # Dagster scaffold (reference only — CI uses GitHub Actions)
│   ├── assets/
│   │   ├── ingestion_assets.py
│   │   ├── transform_assets.py
│   │   └── quality_assets.py
│   └── jobs/
│       └── citycycle_pipeline_job.py
├── analysis/
│   └── notebooks/
│       ├── 01_eda_mock_data.ipynb           # Initial EDA on mock data
│       └── 03_bq_eda_live_data.ipynb        # Live BQ EDA via SQLAlchemy (32M rows)
├── ml/
│   └── models/
│       └── train_demand_model.py     # 3-model comparison: Linear Reg · Random Forest · XGBoost
├── dashboard/
│   ├── app.py                        # Streamlit entry point
│   ├── pages/
│   │   ├── 01_overview.py            # KPIs + daily trend + hourly demand
│   │   ├── 02_station_map.py         # pydeck 3D + folium detailed map
│   │   ├── 03_rebalancing.py         # Intervention list + crew runs estimate
│   │   ├── 04_forecast.py            # 24h XGBoost demand forecast
│   │   └── 05_scenario.py            # Guided scenario planner — corridor + dispatch
│   └── utils/
│       ├── bq_client.py              # BQ connection via cost guard
│       └── mock_data_generator.py    # Synthetic data generator (CI-safe)
├── data/
│   └── mock/
│       ├── cycle_hire_mock.csv       # 10K synthetic rides (CI + dev)
│       └── cycle_stations_mock.csv   # 795 station records
├── docs/
│   └── diagrams/
│       └── dataflow_diagram.png      # Architecture diagram
├── .env.example                      # Template for env vars (no secrets)
├── .gitignore
├── requirements.txt
└── README.md

Getting Started

Prerequisites

Python 3.10+
Google Cloud account with BigQuery access
gcloud CLI authenticated
Node.js 18+ (for pptxgenjs, optional)

1. Clone & Install

git clone https://github.com/YOUR_ORG/dsai4-m2-t2-citycycle.git
cd dsai4-m2-t2-citycycle

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env — add your GCP project ID, BQ dataset names, etc.
# NEVER commit .env to Git

3. Run with Mock Data First (Recommended)

Before touching BigQuery's live data, validate the full pipeline with local mock data:

# Generate mock data
python dashboard/utils/mock_data_generator.py

# Load mock CSV into BigQuery (raw schema)
python ingestion/load_mock.py --mode=mock

# Run dbt transformations
cd transform && dbt run --target dev

# Run quality checks
python quality/run_ge_checks.py

# Launch dashboard
streamlit run dashboard/app.py

4. Run Full Pipeline (Real Data)

Once validated on mock data, switch to live ingestion:

# Meltano ingest from BQ public dataset
cd ingestion && meltano run tap-bigquery target-bigquery

# Then continue with dbt + GE as above

# See .github/workflows/ci.yml for the full CI pipeline

Mock Data Strategy

Why Mock Data First?

BigQuery's free tier provides 1 TB of query processing per month. The cycle_hire table has 83 million rows. A single unguarded SELECT * could consume the entire monthly quota instantly.

Our Approach

Risk	Mitigation
Full-table scan on `cycle_hire`	`LIMIT` clauses on all dev queries; partitioned by `hire_date`
Accidental `SELECT *`	dbt `+limit` macro in dev profile; BQ slot quota set
Exceeding 1 TB free tier	Dry-run cost estimates before every query; budget alert at 80%
Development iteration cost	All development runs against `data/mock/` CSV files
CI/CD test cost	GitHub Actions uses mock data only; no live BQ calls in CI

Mock Data Schema

The mock data mirrors the exact schema of the public BigQuery tables:

cycle_hire_mock.csv    → bike_id, rental_id, duration, start_date,
                         start_station_id, start_station_name,
                         end_date, end_station_id, end_station_name
cycle_stations_mock.csv → id, install_date, installed, latitude,
                          locked, longitude, name, nbdocks,
                          temporary, terminal_name

Pipeline Walkthrough

1. Ingestion (Meltano)

Meltano uses the Singer protocol (tap → target) to extract data from BigQuery and load it into the raw dataset.

tap-bigquery: Reads from bigquery-public-data.london_bicycles
target-bigquery: Writes to your project's raw dataset
Supports full refresh and incremental loads (state-based on start_date)

meltano run tap-bigquery target-bigquery

2. Data Warehouse Design

Star schema optimised for ride analytics and rebalancing queries:

Fact Table:

fact_rides — one row per ride: duration, start/end station FK, date FK, hour, day-of-week, imbalance signals

Dimension Tables:

dim_stations — station metadata: name, location (lat/lon), dock capacity, zone, rebalancing priority tier
dim_date — date spine 2015–2025: year, month, week, is_weekend, is_holiday (UK bank holidays 2023–2024)

3. ELT Transformation (dbt)

raw.cycle_hire
    └── stg_cycle_hire        (cast types, rename columns, parse timestamps, filter bad rows)
        └── int_rides_enriched (join stations, add peak_hour_flag, duration_band, time_period)
            └── fact_rides     (final fact table, join imbalance signals, rolling_7d_avg)

raw.cycle_stations
    └── stg_cycle_stations    (clean nulls, add zone via lat/lon bounding boxes, capacity_tier)
        └── dim_stations       (final dimension, all-time imbalance stats, rebalancing_priority)

Materialisation strategy:

staging/ and intermediate/ → views (zero storage cost, always fresh)
marts/ → tables (materialised for fast dashboard queries)
fact_rides → additionally partitioned by hire_date and clustered by start_station_id, end_station_id for cost-efficient rebalancing queries

Derived columns generated in dbt (selected key fields):

Field	Layer	Formula
`duration_minutes`	intermediate	`duration_seconds / 60.0`
`peak_hour_flag`	intermediate	`1` if `start_hour IN (7, 8, 17, 18)` else `0`
`duration_band`	intermediate	`short` <10 min · `medium` 10–30 · `long` 30–60 · `extended` >60
`time_period`	intermediate	`am_peak` · `pm_peak` · `midday` · `evening` · `night`
`is_round_trip`	intermediate	`TRUE` if `start_station_id = end_station_id`
`net_flow`	intermediate	`total_departures - total_arrivals` per station per day
`imbalance_score`	intermediate	`ABS(net_flow) / MAX(departures + arrivals, 1)` — range 0–1
`is_imbalanced`	intermediate	`TRUE` if `imbalance_score > 0.20`
`rebalancing_priority`	marts	`CRITICAL` ≥0.25 · `HIGH` ≥0.18 · `MEDIUM` ≥0.10 · `LOW` <0.10
`rolling_7d_avg`	marts	7-day rolling average departures per station — ML feature
`ride_sk` / `station_sk`	marts	Surrogate keys via `dbt_utils.generate_surrogate_key`

dbt tests: 56 PASS · 0 ERROR · 3 WARN (intentional severity: warn on nullable FK fields)

4. Data Quality (Great Expectations)

Two checkpoint stages:

Post-ingest checkpoint (raw.*):

rental_id not null, unique
start_date > '2010-01-01'
duration between 60 and 86400 seconds
start_station_id in valid station list

Post-transform checkpoint (fact_rides, dim_stations):

No orphan station FK references
duration_minutes between 1 and 1440
start_station_is_imbalanced only TRUE/FALSE
Null rate < 5% on all key columns

Results: 30 PASS · 4 WARN · 0 FAIL. Results published as HTML data docs.

5. Analysis & ML

EDA (notebooks)

01_eda_mock_data.ipynb — initial pipeline validation on synthetic data
03_bq_eda_live_data.ipynb — full EDA on 32M rides via SQLAlchemy + BigQuery

Key findings from EDA:

Monthly and hourly ride trends with COVID-19 signal
Weekday double-peak confirmed at 08:00 and 17:00–18:00
K-Means k=3 customer segmentation: Leisure 53% · Casual 32% · Commuter 15%
Station-level imbalance ranking: 3 CRITICAL, 27 HIGH priority stations identified

Demand Forecasting Model

Features: hour, day_of_week, is_weekend, is_holiday, season, start_station_id, rolling_7d_avg
Target: hourly departures per station
Models tested: Linear Regression (baseline) · Random Forest · XGBoost
Split: 80/20 train/test · ~10.4M feature rows · no cross-validation
Best model: XGBoost — RMSE 2.422 · MAE 1.508 · R² 0.488

6. Orchestration

CityCycle uses two complementary orchestration layers:

GitHub Actions CI (code quality)

Runs 5 jobs on every push to main:

push to main
│
├── lint              (ruff + black — code style enforcement)
├── mock-data         (generate + validate mock CSV files)
├── dbt-compile       (dbt compile + dbt test against mock data)
├── train-model       (train XGBoost on mock data, validate RMSE)
└── notebook          (validate notebook structure)

Dagster (pipeline orchestration — proof of concept)

Dagster manages the data pipeline as software-defined assets with explicit dependency tracking, metadata logging, and a visual asset graph. The pipeline runs end-to-end against mock data as a proof of concept. In production, mock_bq_load_asset would be replaced by meltano_ingest_asset to trigger live BigQuery ingestion.

Asset dependency graph:

Pipeline execution order:

mock_data_asset          (generate 10K synthetic rides + stations)
  └── mock_bq_load_asset (validate CSV schema — zero BQ cost)
        ├── post_ingest_ge_asset     ← quality gate 1 (14 checks)
        └── dbt_compile_asset        (validate 7 dbt models compile)
              └── dbt_test_asset     (run 57 dbt schema tests)
                    └── post_transform_ge_asset  ← quality gate 2

Run the pipeline locally:

pip install dagster dagster-webserver
dagster dev -f orchestration/jobs/citycycle_pipeline_job.py
# Open http://localhost:3000 → click Materialize all

Production architecture (future): Replace mock_bq_load_asset with meltano_ingest_asset to run the full pipeline against live BigQuery on the daily 02:00 UTC schedule defined in citycycle_daily_02utc.

7. Dashboards

Streamlit (Operational) — 5 pages

Page	File	Description
Overview	`01_overview.py`	Daily ride KPIs, imbalance score, fleet utilisation
Station Map	`02_station_map.py`	Pydeck geospatial map of all 795 stations, colour-coded by priority (CRITICAL · HIGH · MEDIUM · LOW)
Rebalancing	`03_rebalancing.py`	Ranked list of stations needing intervention, with predicted demand delta
Forecast	`04_forecast.py`	24h XGBoost demand forecast per station
Scenario Planner	`05_scenario.py`	Guided 6-step operational crew dispatch workflow (see below)

Scenario Planner — 05_scenario.py

The scenario planner is the operational heart of the dashboard — a guided 6-step workflow designed for a rebalancing operations manager planning daily crew dispatch:

Select date range — Map and table filter dynamically to show imbalance patterns for that specific window, not all-time averages
Identify danger zone stations — 795 stations colour-coded by priority; red/orange concentration in inner London immediately visible
Filter to CRITICAL and HIGH — Narrows to the 30 stations needing daily attention; table shows imbalance score, net flow direction, and bikes needed
Understand draining vs filling — Draining = deliver bikes; Filling = collect bikes; quantity derived directly from abs(net_flow) — no manual calculation
Review the dispatch list — Filtered table becomes the crew morning briefing; production-ready output
Forecast hourly demand — XGBoost predicts departures per station per hour; RMSE 2.422 — sufficient for crew scheduling decisions

# Run the full dashboard
streamlit run dashboard/app.py

# Or open the static scenario planner directly (no BQ connection needed)
open docs/dashboard/05_scenario_static.html

Looker Studio (Executive)

Connected directly to BigQuery citycycle_dev_marts via live connector. Includes:

KPI scorecard: total rides · avg ride duration · avg imbalance score
Rider segments: Casual 46% · Leisure 29% · Commuter 24% (approximated from ride behaviour — weekend flag + peak hour flag)
Daily ride trend: Monthly time series 2020–2023 with COVID signal visible
Station rebalancing tables: Draining and filling stations ranked by imbalance score
Date range filter: Fixed to 2020-01-01 → 2023-01-15 — dataset does not extend to present

📄 Looker Studio PDF export: docs/presentation/lookerstudio_CityCycle_London_Ops.pdf

Key Findings (Live Data — 32M Rides, 2020–2023)

These findings are based on 32,342,086 real rides from bigquery-public-data.london_bicycles ingested into the CityCycle data warehouse and analysed via the full ELT pipeline.

Metric	Value	Insight & Business Implication
Total rides (2020–2023)	32,342,086	3 years of real operational data — sufficient for trend detection and ML model training
Average ride duration	21.8 minutes	Mixed use confirmed — commuter and leisure trips coexist across the network
Rider segments	Leisure 53% · Casual 32% · Commuter 15%	K-Means k=3 — each segment drains stations in different directions at different times
Weekend rides	9.47M (29.3%)	Strong leisure demand — different rebalancing strategy required Sat/Sun vs weekdays
Avg network imbalance score	0.084	Network-wide baseline — scores above 0.18 trigger scheduled crew intervention
Critical stations	3 stations	Score ≥ 0.25: New North Road Hoxton · Ladbroke Grove Central · Cloudesley Road Angel
High priority stations	27 stations	Score ≥ 0.18 — require scheduled daily intervention by rebalancing crews
XGBoost RMSE	2.422 rides/hr	Best of 3 models tested — within operational planning threshold for crew dispatch
Top draining station	New North Road Hoxton	Score 0.324 · Net flow +7.0 bikes/day · Draining 90% of days — needs daily pre-AM crew
COVID-19 signal	Visible in 2020	Clear ridership collapse March–May 2020, full recovery to pre-pandemic levels by mid-2022

Exogenous Factors Affecting Ridership

Ride demand across 2020–2023 is shaped by two compounding forces outside the pipeline's control: COVID-19 lockdown waves and London's natural winter cycling seasonality. Neither appears as a feature in the ML model, which is why XGBoost explains 49% of variance — the remaining 51% is largely attributable to these exogenous signals.

Period	Type	Impact
Mar–Jun 2020	🔴 Lockdown 1	~500K rides/month — first national lockdown
Nov 2020–Jan 2021	🔴 Lockdown 2 + 3	~400K rides/month — deepest trough in dataset
Dec 2021–Jan 2022	🟡 Omicron / Plan B	~650K — work from home reintroduced
Jan, Dec each year	🔵 Winter seasonality	Natural trough — cold weather, short daylight hours

Despite pandemic disruption, each subsequent summer peak exceeded the last — ~1.1M in summer 2020, ~1.2M in 2021, reaching 1,302,994 in July 2022 — confirming underlying demand recovered and grew year-on-year.

📊 The chart below is also available as a standalone file: docs/charts/chart_ride_trends_annotated.html

📈 View Ride Trends 2020–2023 (annotated)

Open docs/charts/chart_ride_trends_annotated.html in a browser for the full interactive chart with lockdown bands, winter seasonality overlays, and hover tooltips per month.

Key data points directly from citycycle_dev_marts.fact_rides:

Month	Rides	Event
Jan 2020	~700K	Pre-pandemic winter baseline
Apr 2020	~500K	Lockdown 1 trough (−30%)
Jul 2020	~1.1M	Post-lockdown summer recovery
Jan 2021	~400K	Deepest trough — Lockdown 3 + winter seasonality compounding
Jul 2022	1,302,994	Dataset peak — full recovery confirmed
Jan 2023	~240K	End of dataset winter trough (partial month)

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
BigQuery free tier exceeded	Medium	High	Mock data dev; LIMIT guards; dry-run estimates; budget alerts
Meltano tap-bigquery schema drift	Low	Medium	dbt schema tests; GE not-null/type checks catch regressions
Long BQ query runtime in CI	Medium	Medium	CI uses mock CSV only; no live BQ in GitHub Actions
ML model staleness	Medium	Medium	Retrain script in `train_demand_model.py`; model versioned in `ml/models/`
Dashboard downtime	Low	Low	Streamlit caches last-good result; graceful error states
Credentials leaked to Git	Low	Critical	.gitignore covers all credential patterns; .env.example only

Data Dictionary

Source Tables (`citycycle_raw`)

`cycle_hire` — Raw ride records

Field	Type	Description
`rental_id`	INT64	Unique identifier for each ride
`bike_id`	INT64	Identifier of the bike used
`duration`	INT64	Ride duration in seconds
`start_date`	TIMESTAMP	Date and time the ride began
`end_date`	TIMESTAMP	Date and time the ride ended
`start_station_id`	INT64	ID of the station where the bike was hired
`start_station_name`	STRING	Name of the hire station
`end_station_id`	INT64	ID of the station where the bike was returned (nullable — ~312K lost/unreturned bikes)
`end_station_name`	STRING	Name of the return station

`cycle_stations` — Raw station metadata

Field	Type	Description
`id`	INT64	Unique station identifier
`name`	STRING	Station name and location description
`terminal_name`	STRING	Physical terminal code on the docking unit
`latitude`	FLOAT64	Station latitude (WGS84)
`longitude`	FLOAT64	Station longitude (WGS84)
`docks_count`	INT64	Number of physical docking points at the station
`installed`	BOOL	Whether the station is currently installed
`locked`	BOOL	Whether the station is locked/out of service
`temporary`	BOOL	Whether the station is a temporary installation
`install_date`	DATE	Date the station was installed

Calculated Fields Summary

Approximately half of all fields in the star schema are engineered features derived from raw source data — not simply loaded from BigQuery. The table below summarises every calculated field and its formula across all three layers.

fact_rides

Field	Formula
`hire_date`	`DATE(start_datetime)`
`start_hour`	`EXTRACT(HOUR FROM start_datetime)`
`day_of_week`	`EXTRACT(DAYOFWEEK FROM start_datetime)` — 1=Sunday, 7=Saturday
`is_weekend`	`day_of_week IN (1, 7)`
`duration_minutes`	`duration_seconds / 60.0`
`duration_band`	`short` <10 min · `medium` 10–30 · `long` 30–60 · `extended` >60
`peak_hour_flag`	`1` if `start_hour IN (7, 8, 17, 18)` else `0`
`time_period`	`am_peak` (07–09) · `pm_peak` (17–19) · `midday` · `evening` · `night`
`season`	`spring` (Mar–May) · `summer` (Jun–Aug) · `autumn` (Sep–Nov) · `winter` (Dec–Feb)
`is_round_trip`	`TRUE` if `start_station_id = end_station_id`
`ride_sk`	Surrogate key — `dbt_utils.generate_surrogate_key(['rental_id'])`
`net_flow`	`total_departures - total_arrivals` per station per day (joined from `int_station_daily_stats`)
`imbalance_score`	`ABS(departures - arrivals) / MAX(departures + arrivals, 1)` — range 0 to 1
`imbalance_direction`	`draining` (net_flow > 0) · `filling` (net_flow < 0) · `balanced` (net_flow = 0)
`rebalancing_priority`	`CRITICAL` (≥0.25) · `HIGH` (≥0.18) · `MEDIUM` (≥0.10) · `LOW` (<0.10)
`rolling_7d_avg`	7-day rolling average of departures per station — used as ML feature

dim_stations

Field	Formula
`zone`	London area classification derived from lat/lon bounding boxes
`capacity_tier`	`small` (≤15 docks) · `medium` (≤24 docks) · `large` (>24 docks)
`avg_imbalance_score_7d`	All-time average imbalance score across full dataset
`rebalancing_priority`	Same formula as fact_rides — based on all-time avg imbalance score
`total_departures_all_time`	`COUNT(*)` of rides departing from station across full dataset
`total_arrivals_all_time`	`COUNT(*)` of rides arriving at station across full dataset
`station_sk`	Surrogate key — `dbt_utils.generate_surrogate_key(['station_id'])`

dim_date

Field	Formula
`year`	`EXTRACT(YEAR FROM full_date)`
`month`	`EXTRACT(MONTH FROM full_date)`
`day`	`EXTRACT(DAY FROM full_date)`
`week_num`	`EXTRACT(WEEK FROM full_date)`
`day_of_week`	`EXTRACT(DAYOFWEEK FROM full_date)`
`is_weekend`	`day_of_week IN (1, 7)`
`season`	Same CASE logic as fact_rides
`is_uk_bank_holiday`	Hardcoded for 2023–2024 only — FALSE for all 2020–2022 rides

Fields sourced directly from raw data (not calculated): rental_id, bike_id, start_datetime, end_datetime, start_station_id, end_station_id, station_name, latitude, longitude, nb_docks, is_installed, is_locked, zone, terminal_name

Staging Layer (`citycycle_dev_staging`)

`stg_cycle_hire` — Cleaned ride records

All raw fields are retained and the following are added or renamed:

Field	Type	Source	Description
`rental_id`	INT64	raw	Cast to INT64, null rows removed
`bike_id`	INT64	raw	Cast to INT64
`start_datetime`	TIMESTAMP	`start_date`	Renamed and cast to TIMESTAMP
`end_datetime`	TIMESTAMP	`end_date`	Renamed and cast to TIMESTAMP
`duration_seconds`	INT64	`duration`	Renamed to make unit explicit
`hire_date`	DATE	Calculated	`DATE(start_datetime)` — extracts the calendar date for partitioning and daily aggregations
`start_hour`	INT64	Calculated	`EXTRACT(HOUR FROM start_datetime)` — hour of day (0–23) used for temporal analysis and ML features
`day_of_week`	INT64	Calculated	`EXTRACT(DAYOFWEEK FROM start_datetime)` — 1=Sunday … 7=Saturday
`is_weekend`	BOOL	Calculated	`TRUE` if day_of_week IN (1, 7) — used to split commuter vs leisure demand patterns

Row filters applied in staging:

rental_id IS NOT NULL
duration_seconds BETWEEN 60 AND 86400
end_datetime > start_datetime

These filters reduce 32,369,326 raw rows to 32,342,086 in the final fact table.

`stg_cycle_stations` — Cleaned station records

Field	Type	Source	Description
`station_id`	INT64	`id`	Renamed for consistency
`station_name`	STRING	`name`	Renamed for clarity
`terminal_name`	STRING	raw	Physical terminal code
`latitude`	FLOAT64	raw	Cast to FLOAT64
`longitude`	FLOAT64	raw	Cast to FLOAT64
`nb_docks`	INT64	`docks_count`	Number of docking points
`is_installed`	BOOL	`installed`	Renamed for consistency
`is_locked`	BOOL	`locked`	Renamed for consistency
`is_temporary`	BOOL	`temporary`	Renamed for consistency
`install_date`	DATE	raw	Cast to DATE
`zone`	STRING	Calculated	London area classification based on lat/lon bounding boxes: `City & Shoreditch`, `Westminster & Victoria`, `Waterloo & Southbank`, `Camden & Islington`, `East End & Canary Wharf`, `Kensington & Chelsea`, `Other`.
`capacity_tier`	STRING	Calculated	Station size classification: `small` (≤15 docks), `medium` (≤24 docks), `large` (>24 docks).

Intermediate Layer (`citycycle_dev_intermediate`)

`int_rides_enriched` — Rides joined with station data

Joins stg_cycle_hire with stg_cycle_stations (twice — once for start, once for end station) and adds business logic flags:

Field	Type	Description
`duration_minutes`	FLOAT64	Calculated — `duration_seconds / 60.0`
`duration_band`	STRING	Calculated — `short` (<10 min), `medium` (10–30 min), `long` (30–60 min), `extended` (>60 min)
`peak_hour_flag`	INT64	Calculated — `1` if `start_hour IN (7, 8, 17, 18)`, else `0`. Marks London commuter peak hours. Core ML feature.
`time_period`	STRING	Calculated — `am_peak`, `pm_peak`, `midday`, `evening`, `night`
`is_round_trip`	BOOL	Calculated — `TRUE` if `start_station_id = end_station_id`
`start_zone`	STRING	Joined from `stg_cycle_stations` — zone of the departure station
`start_lat` / `start_lon`	FLOAT64	Joined — coordinates for geospatial mapping
`start_nb_docks`	INT64	Joined — dock capacity of the departure station
`start_capacity_tier`	STRING	Joined — size tier of the departure station
`end_zone`	STRING	Joined from `stg_cycle_stations` — zone of the return station
`end_lat` / `end_lon`	FLOAT64	Joined — coordinates of return station

`int_station_daily_stats` — Daily imbalance per station

Aggregates ride data to one row per station per day, computing the core rebalancing metrics:

Field	Type	Description
`hire_date`	DATE	Calendar date
`station_id`	INT64	Station identifier
`total_departures`	INT64	Calculated — count of rides starting at this station on this date
`total_arrivals`	INT64	Calculated — count of rides ending at this station on this date
`net_flow`	INT64	Calculated — `total_departures - total_arrivals`. Positive = draining, negative = filling
`imbalance_score`	FLOAT64	Calculated — `ABS(net_flow) / MAX(departures + arrivals, 1)`. Normalised 0–1 score.
`is_imbalanced`	BOOL	Calculated — `TRUE` if `imbalance_score > 0.20`
`imbalance_direction`	STRING	Calculated — `draining`, `filling`, or `balanced`
`utilisation_rate`	FLOAT64	Calculated — `(departures + arrivals) / (nb_docks × 2)`
`peak_departures`	INT64	Calculated — departures during peak hours only

Marts Layer (`citycycle_dev_marts`)

`fact_rides` — Final fact table (32,342,086 rows)

One row per ride, partitioned by hire_date (day granularity), clustered by start_station_id, end_station_id. Combines all enriched ride fields with station-level imbalance signals joined from int_station_daily_stats:

Field	Type	Description
`ride_sk`	STRING	Calculated — surrogate key via `dbt_utils.generate_surrogate_key(['rental_id'])`
`start_station_imbalance_score`	FLOAT64	Joined — imbalance score of the departure station on the ride date
`start_station_is_imbalanced`	BOOL	Joined — whether the departure station was flagged as imbalanced
`start_station_imbalance_direction`	STRING	Joined — `draining`, `filling`, or `balanced`
`start_station_net_flow`	INT64	Joined — net bike flow at the departure station on the ride date
`start_station_utilisation_rate`	FLOAT64	Joined — utilisation rate of the departure station
`start_station_rolling_7d_avg`	FLOAT64	Joined — 7-day rolling average demand, used as ML feature

`dim_stations` — Station dimension table (798 rows)

One row per station with all-time average imbalance metrics:

Field	Type	Description
`station_sk`	STRING	Calculated — surrogate key
`avg_imbalance_score_7d`	FLOAT64	Calculated — all-time average imbalance score (uses full dataset average; named `_7d` for historical reasons)
`rebalancing_priority`	STRING	Calculated — `CRITICAL` (≥0.25), `HIGH` (≥0.18), `MEDIUM` (≥0.10), `LOW` (<0.10)
`total_departures_all_time`	INT64	Calculated — cumulative departures across the full dataset
`total_arrivals_all_time`	INT64	Calculated — cumulative arrivals across the full dataset
`last_activity_date`	DATE	Most recent date with recorded activity at this station

`dim_date` — Date dimension (4,000 rows)

Date spine from 2015 to 2025:

Field	Type	Description
`date_id`	INT64	Primary key — YYYYMMDD integer
`full_date`	DATE	Full date value
`year`	INT64	Calendar year
`month`	INT64	Month number (1–12)
`week_number`	INT64	ISO week number
`day_of_week`	INT64	Day number (1=Sunday … 7=Saturday)
`is_weekend`	BOOL	TRUE for Saturday and Sunday
`season`	STRING	`spring`, `summer`, `autumn`, `winter` based on month
`is_uk_bank_holiday`	BOOL	TRUE for UK bank holidays — hardcoded for 2023–2024 only

Future Improvements & Next Steps

The current pipeline establishes the core ELT infrastructure, imbalance detection, and demand forecasting. The following extensions would move it toward a full production rebalancing system:

#	Improvement	Description
01	Enrich the feature set	Integrate TfL's live BikePoint API for real-time dock occupancy, Met Office weather data, and UK event calendars. Target: push R² from 0.488 toward 0.70+ by capturing exogenous demand signals.
02	Segment-aware rebalancing schedules	Operationalise the K-Means segmentation — automatically tag stations by dominant rider type and generate differentiated crew schedules: pre-AM stocking for commuter stations on weekdays, mid-morning restocking for leisure stations on weekends.
03	Predictive dock capacity alert	Surface same-day shortfall alerts when the XGBoost forecast predicts more departures than current bike count — before the imbalance accumulates over multiple days.
04	Expand to other cities	The pipeline is city-agnostic. Pointing the Meltano tap at Dublin Bikes, New York Citi Bike, or Paris Vélib public datasets replicates the full intelligence pipeline with minimal rework.
05	Automate the full production pipeline	Activate the Dagster `citycycle_daily_02utc` schedule — replace `mock_bq_load_asset` with `meltano_ingest_asset` to run nightly ingestion, transformation, and quality checks automatically.

Contributing

Fork and create a feature branch: git checkout -b feat/your-feature
Develop against mock data only (--target dev in dbt)
Run dbt test before committing
Open a PR against main — CI will run linting and mock-data tests
Never commit .env, profiles.yml, or any *keyfile*.json

DSAI4 Module 2 · Team 2 · March 2026

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
analysis/notebooks		analysis/notebooks
dashboard		dashboard
data/mock		data/mock
docs		docs
ingestion		ingestion
ml/models		ml/models
orchestration		orchestration
quality		quality
transform		transform
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚲 CityCycle London — Bike Rebalancing Intelligence Pipeline

Table of Contents

Business Problem

Solution Overview

Architecture

Tech Stack

Repository Structure

Getting Started

Prerequisites

1. Clone & Install

2. Configure Environment

3. Run with Mock Data First (Recommended)

4. Run Full Pipeline (Real Data)

Mock Data Strategy

Why Mock Data First?

Our Approach

Mock Data Schema

Pipeline Walkthrough

1. Ingestion (Meltano)

2. Data Warehouse Design

3. ELT Transformation (dbt)

4. Data Quality (Great Expectations)

5. Analysis & ML

EDA (notebooks)

Demand Forecasting Model

6. Orchestration

GitHub Actions CI (code quality)

Dagster (pipeline orchestration — proof of concept)

7. Dashboards

Streamlit (Operational) — 5 pages

Looker Studio (Executive)

Key Findings (Live Data — 32M Rides, 2020–2023)

Exogenous Factors Affecting Ridership

Risks & Mitigations

Data Dictionary

Source Tables (citycycle_raw)

cycle_hire — Raw ride records

cycle_stations — Raw station metadata

Calculated Fields Summary

fact_rides

dim_stations

dim_date

Staging Layer (citycycle_dev_staging)

stg_cycle_hire — Cleaned ride records

stg_cycle_stations — Cleaned station records

Intermediate Layer (citycycle_dev_intermediate)

int_rides_enriched — Rides joined with station data

int_station_daily_stats — Daily imbalance per station

Marts Layer (citycycle_dev_marts)

fact_rides — Final fact table (32,342,086 rows)

dim_stations — Station dimension table (798 rows)

dim_date — Date dimension (4,000 rows)

Future Improvements & Next Steps

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0