I built this project to turn the London bike sharing data into a clear, reproducible analysis with a small KPI set, a baseline predictive model, and a Tableau-ready export.
- Quantify ride demand patterns over time and by weather
- Report a concise KPI summary (average, median, max, min daily rides)
- Build a baseline model to predict ride counts
- Export Tableau-ready files and a dashboard
Kaggle dataset hmavrodiev/london-bike-sharing-dataset (downloaded via Kaggle CLI).
- Input:
london_merged.csv - Raw download:
london-bike-sharing-dataset.zip - Output (notebook):
london_bikes_final.xlsx,london_bikes_final.csv - Output (script):
outputs/data_summary.json,outputs/kpis.json,outputs/model_metrics.json,outputs/model_cv_metrics.json - Dashboard:
London Bike Rides.twbx
- Input columns:
timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,season - Renamed columns:
time,count,temp_real_C,temp_feels_like_C,humidity_percent,wind_speed_kph,weather,is_holiday,is_weekend,season
2015-01-04 00:00:00 to 2017-01-03 23:00:00
- Import pandas, zipfile, and kaggle libraries.
- Download the dataset from Kaggle via CLI.
- Extract the downloaded zip file.
- Load
london_merged.csvinto a pandas DataFrame. - Parse timestamps to datetime.
- Inspect structure and size (
info,shape, preview). - Check category distributions for
weather_codeandseason. - Rename columns for readability and compute time coverage.
- Convert humidity to a 0–1 proportion.
- Map season and weather codes to descriptive labels.
- Run validation and range checks (missing values, duplicates, negative counts, humidity bounds).
- Engineer lagged and rolling features from historical counts.
- Train a baseline regression model and record evaluation metrics.
- Export the cleaned dataset to
london_bikes_final.xlsxandlondon_bikes_final.csv.
From outputs/kpis.json:
- Average daily rides: 27,268.45
- Median daily rides: 27,011.50
- Max daily rides: 72,504 (2015-07-09)
- Min daily rides: 4,869 (2016-01-03)
I used a Random Forest regressor with a time-based split and lag features, and compared it against Gradient Boosting with time-series cross-validation. Metrics are stored in outputs/model_metrics.json and outputs/model_cv_metrics.json.
london_bikes.ipynb: primary analysis notebook.london_merged.csv: raw dataset used in the notebook.london-bike-sharing-dataset.zip: downloaded dataset archive.london_bikes_final.xlsx: cleaned output used for Tableau.London Bike Rides - Moving Average and Heatmap.twbx: Tableau workbook.London Bike Rides.twbx: Tableau dashboard.requirements.txt: Python dependencies inferred from notebook imports.scripts/run_analysis.py: standalone analysis script for KPIs, plots, and ML baseline.outputs/: generated KPI and model metrics.assets/: generated plots..DS_Store: macOS metadata file, not used by the analysis.
- pandas
- kaggle
- zipfile (Python standard library)
- scikit-learn, numpy, matplotlib, seaborn
- Install dependencies from
requirements.txt. - Open
london_bikes.ipynbin Jupyter. - Ensure Kaggle CLI is configured, then run the Kaggle download cell.
- Run the extraction, loading, transformation, and validation cells in order.
- Run the analysis script to generate KPIs, plots, and metrics:
python scripts/run_analysis.py. - Confirm
london_bikes_final.xlsx,london_bikes_final.csv,outputs/*.json, andassets/*.pngare created.
- Dataset size reported as 17,414 rows and 10 columns.
- No missing values reported across all 10 columns in
bikes.info(). - Weather code counts and season counts are displayed for categorical checks.
- Preview of renamed and mapped columns shown via
bikes.head(). - Time coverage: 2015-01-04 00:00:00 to 2017-01-03 23:00:00.
- Validation results: 0 duplicates, 0 negative counts, 0 humidity out-of-range.
- KPI highlights: average daily rides 27,268.45; max daily rides 72,504 on 2015-07-09.
- Baseline model metrics: RMSE 149.54, MAE 86.23, R2 0.982 (time-based split).
- Cross-validation (time-series) averages: Random Forest RMSE 183.60, MAE 104.41, R2 0.971; Gradient Boosting RMSE 261.36, MAE 172.61, R2 0.942.
- Validation outputs are not saved; results require running the notebook.
- Baseline model uses a single train/test split without hyperparameter tuning.
- Data source details beyond the Kaggle dataset identifier are not documented in the notebook.
- Persist validation outputs (e.g., save a CSV/JSON report) for reproducibility.
- Add a brief data dictionary or source notes based on the Kaggle dataset page.
- Parameterize file paths and outputs for easier reuse.
Unknown.
How to verify: check for a LICENSE file in the repository root or repository metadata.


