This repository documents my participation in the EY Data Science Challenge, where I developed a predictive model for the Urban Heat Island (UHI) Index in New York City. I integrated a variety of open-source datasets and engineered over 260 features to capture urban morphology, environmental factors, socio-economic characteristics, and atmospheric conditions. A robust machine learning pipeline was implemented, leveraging ensemble models to achieve high predictive performance. The final model achieved a validation R² score of 0.9822 using an ExtraTreesRegressor.
I utilized a diverse set of open-source datasets, linked to the original training and validation datasets via latitude and longitude coordinates. Below is a detailed breakdown of the datasets, their sources, and the features engineered from them.
Features:
- Building counts within various radii:
building_count_10m,building_count_20m, ...,building_count_1000m - Building height and area metrics:
Tallest_Building_*_HEIGHT,Average_Building_Height_*(10m to 1000m),Total_Building_Area_500m
Sources:
- Contest Dataset: Building_Footprint.kml
- NYC Open Data: Building Footprints
Algorithms:
- Counted building footprints within multiple distance thresholds (10m to 1000m) for each point in the training and validation datasets.
- Extracted building height features using the
HEIGHTROOFattribute via spatial joins with the NYC dataset.
Features:
- Distance metrics:
dist_to_road,dist_to_park,dist_to_water - Ratio and density metrics:
roads_ratio_*,parks_ratio_*,water_ratio_*,water_kde,weighted_water_score(100m to 1000m) - Land use ratios:
landuse_residential_ratio_*,landuse_commercial_ratio_*, etc. (100m to 1000m) - Transit and pedestrian features:
transit_count_*,ped_cycle_count_*,parking_area_ratio_*(100m to 1000m) - Additional urban geometry:
svf_100m(sky view factor),road_major_ratio_*,road_minor_ratio_*
Source:
- OpenStreetMap via OSMnx
Algorithms:
- Computed distances to roads, parks, and water bodies using OSMnx.
- Calculated asset ratios (e.g., roads, parks, water) within multiple buffer zones.
- Applied kernel density estimations (KDE) and weighted scores to capture the spatial distribution of urban assets.
Features:
- Tree counts and average diameters:
tree_count_50m,tree_avg_diam_50m, ...,tree_count_1000m,tree_avg_diam_1000m
Source:
- NYC Open Data: 2015_Street_Tree_Census_-_Tree_Data_20250221.csv
Algorithms:
- For each location, counted trees and computed average diameters within various distance thresholds (50m to 1000m).
Features:
- Pollutant metrics: Fine particles (PM 2.5), Nitrogen dioxide (NO2), Ozone (O3)
- Summary statistics:
pm2.5_avg_JJA,pm2.5_median_JJA, ...,pm2.5_val_21Jul2021, and similarly for SO2, CO, NO2, and Ozone
Sources:
- NYC Open Data: Air_Quality_20250221.csv
- AirNow.gov (US EPA, NOAA, NASA): AQ
Algorithms:
- Filtered raw data by time periods (e.g., June to August 2021) and pivoted to create location-wise pollutant summaries.
- Spatially joined pollutant data with training/validation points using a nearest-neighbor approach.
Feature:
elevation
Source:
- Python package
pyhigh
Algorithm:
- Assigned elevation values based on geographical coordinates using the
pyhighpackage.
Features:
- Weather measurements:
air_temp_surface,relative_humidity,wind_speed,wind_direction,solar_flux,s2_value
Source:
- Contest Dataset: NY_Mesonet_Weather.xlsx
Algorithms:
- Filtered data for a specific time window (e.g., 15:00–16:00 on 2021-07-24) and computed mean values.
- Assigned weather features to each location based on the nearest weather station using the Haversine distance.
Features:
- Sentinel-2 indices and bands:
2021_08_25_00_00_2021_08_25_23_59_Sentinel_2_L2A_NDVI,NDWI,Moisture_index,False_color, etc. - Sentinel-3 brightness temperatures and reflectance:
2021_07_24_00_00_2021_07_24_23_59_Sentinel_3_SLSTR_F1_Brightness_Temperature, etc. - Landsat 8:
L8_ST_B10_raw,L8_ST_B10_C,lst_value,lst_value_ndvi
Source:
- Contest Dataset: Sentinel2, Sentinel_3, Landsat_LST.tiff, Landsat_NDVI.tiff, S2_DATA.tiff
Algorithms:
- Performed raster sampling using the
rasteriolibrary to extract pixel values based on latitude and longitude. - Computed derived indices (e.g., NDVI, NDWI) to assess vegetation health, moisture levels, and urban heat signatures.
Cooling Tower Data:
- Features:
tower_count_100m,tower_count_200m,tower_count_500m,tower_count_1000m - Source: NYC_Cooling_Tower_Registrations_20250224.csv
Energy & Water Consumption Data:
- Features:
sum_net_emissions_mtco2e_*,sum_weather_normalized_site_energy_use_(kbtu)_*,sum_weather_normalized_site_natural_gas_use_(therms)_*,sum_weather_normalized_site_electricity_(kwh)_*(500m, 1000m) - Source: Energy_and_Water_Data_Disclosure_for_Local_Law_84_2022__Data_for_Calendar_Year_2021__20250224.csv
Census Tract Data:
- Features: Socio-economic metrics like
population_density,total_population,median_income,poverty_count, ...,crowded_households, and aggregated values (Pop_300m,Income_1000m, etc.) - Source: nyc_census_tracts.csv
Hyperlocal Temperature Data:
- Features: Temperature statistics for June, July, August 2018:
_AvgTemp_6,_MaxTemp_7,_UHI_8, etc. - Source: Hyperlocal_Temperature_Monitoring_20250311.csv
Wind Atlas Data:
- Features:
air_density,power_density - Source: USA_air-density_10m.tif, USA_power-density_10m.tif
Street Pavement & Monthly Weather Data:
- Features: Pavement ratings (
Width,Rating_B), climate bands (nclimgrid_band1, ...,nclimgrid_band4) - Sources: StreetAssessmentRating, nclimgrid-monthly-202107.tif
Traffic Volume Data:
- Features:
Traffic_Volume_Avg,Traffic_Volume_min,Traffic_Volume_max,Traffic_Volume_med - Source: Automated_Traffic_Volume_Counts_20250319.csv
A comprehensive machine learning pipeline was designed to predict the UHI Index. Below is an overview of the process, as implemented in the shared code:
- Initial Processing: Dropped non-model columns (
Latitude,Longitude,datetime) to focus on predictive features. - Missing Value Handling:
- Dropped columns with more than 50% missing values.
- Imputed remaining missing values using a KNN Imputer (
n_neighbors=5). - Filled residual NaN values with 0.
- Feature Engineering: Created interaction features by multiplying pairs of environmental variables (e.g., temperature, building, water, tree, park-related features) to capture non-linear relationships, limited to a subset of 5 features to control feature explosion.
- Feature Distribution Analysis: Examined all 260+ features for their distributions using histograms and boxplots to identify skewness, outliers, and deviations from normality (e.g.,
building_count_1000mshowed right-skewness). - Missing Value Assessment: Calculated the percentage of missing values per feature, flagging columns for removal or imputation.
- Correlation Analysis: Computed a correlation matrix to identify multicollinearity
- Outlier Detection: Used Z-scores and IQR methods to detect and cap outliers (e.g., in
tree_avg_diam_1000m). - Target Variable Analysis: Analyzed the UHI Index for its distribution, noting a near-normal distribution with slight right-skewness.
- Feature-Target Relationships: Used scatter plots and Pearson correlations to assess relationships (e.g.,
nclimgrid_band1andIncome_1000mshowed strong correlations with the UHI Index).
- Methods Explored: Tested multiple feature selection techniques, including Information Value (IV), Recursive Feature Elimination (RFE), Genetic Algorithms (GA), Simulated Annealing (SA), Boruta, Variable Importance (VI), and Linear Regression (LR).
- Best Method: Used an ExtraTreesRegressor with a median threshold to select 50% of the features, reducing dimensionality while retaining predictive power.
- Initial Model Exploration: Tested Neural Networks, Linear Regression, Random Forest, Gradient Boosting, and CatBoost, but they showed low performance (R² < 0.95).
- Selected Models: Focused on three ensemble models: ExtraTreesRegressor, XGBoost, and LightGBM.
- Hyperparameter Optimization (HPO):
- ExtraTrees & XGBoost: Used Optuna with 200 trials each, employing a Tree-structured Parzen Estimator (TPE) sampler to maximize cross-validation R² scores.
- LightGBM: Used Bayesian Optimization via BayesSearchCV with 30 iterations.
- Cross-validation (5-fold) was used to evaluate model performance during HPO.
- Model Blending: Attempted a weighted blending approach, combining predictions from ExtraTrees, XGBoost, and LightGBM using weights proportional to their validation R² scores. However, this did not improve performance over the individual models.
- Best Model: The ExtraTreesRegressor outperformed others with a validation R² score of 0.973478. Model blending yielded an R² of 0.973082, so ExtraTrees was selected for the final predictions.
- Retraining: Retrained the ExtraTrees model on the full training dataset using the best hyperparameters.
- Prediction: Generated predictions on the validation dataset and created the submission file by combining the predicted UHI Index with the original latitude and longitude coordinates.
- Output Files: Saved the submission (
extratrees_submission_[timestamp].csv), hyperparameters, blending weights, and feature importance metrics for reference.
The top 10 features contributing to the ExtraTrees model’s predictions were:
nclimgrid_band1: 0.039678 (climate grid data)Income_1000m: 0.032545 (average income within 1000m)Income_500m: 0.025860 (average income within 500m)Average_Building_Height_1000m: 0.023096 (average building height within 1000m)tree_avg_diam_1000m: 0.020426 (average tree diameter within 1000m)roads_ratio_1000m: 0.020365 (road ratio within 1000m)nclimgrid_band4: 0.019333 (climate grid data)Traffic_Volume_med: 0.017134 (median traffic volume)Average_Building_Height_750m: 0.016680 (average building height within 750m)parks_ratio_1000m: 0.016602 (park ratio within 1000m)
These features highlight the importance of climate data, socio-economic factors, urban structure, and green spaces in predicting UHI intensity.
feature_engineering.ipynb: Contains the code for loading datasets, performing spatial joins, and engineering features (e.g., building counts, OSM ratios, satellite indices).model_training.py: Implements the machine learning pipeline, including preprocessing, feature selection, model training, hyperparameter optimization, blending, and submission generation.feature_stats_[timestamp].csv: Saves feature statistics (mean, std, missing percentage) for analysis.selected_features_[timestamp].csv: Lists the features selected by the ExtraTrees feature selection step.et_feature_importance_[timestamp].csv: Feature importance scores from the final ExtraTrees model.extratrees_model_[timestamp].pkl: Saved ExtraTrees model for production use.extratrees_submission_[timestamp].csv: Final submission file with predicted UHI Index values.
The project relies on the following Python libraries:
pandas,numpy: Data manipulation and numerical operationsgeopandas,shapely,osmnx: Geospatial data processingscipy,sklearn: Feature selection, imputation, and machine learningoptuna,skopt: Hyperparameter optimizationxgboost,lightgbm,catboost: Ensemble modelsrasterio: Satellite imagery processingmatplotlib: Visualization for EDAjoblib: Model serialization
Best Model: ExtraTreesRegressor
Validation R² Score in submission: 0.0.9822
Execution Time: [Pipeline completed in 25805.15 seconds (430.09 minutes) The pipeline successfully captured the complex relationships between urban features and the UHI Index, achieving strong predictive performance.
Below is a comprehensive list of all files in the repository, including datasets, notebooks, and output files, with their purposes and links.
- 1-building-density.ipynb: Processes building footprint data to calculate building counts and density metrics within various radii (e.g.,
building_count_10mtobuilding_count_1000m). - 2-mesonet-weather.ipynb: Extracts weather data from the NY Mesonet dataset (
NY_Mesonet_Weather.xlsx) and assigns features likeair_temp_surface,wind_speed, etc., to training/validation datasets based on the nearest weather station. - 2-sat-data-lsat-s2-s3.ipynb: Processes satellite imagery data (Sentinel-2, Sentinel-3, Landsat) to extract features like
s2_value,lst_value,L8_ST_B10_raw,2021_08_25_00_00_2021_08_25_23_59_Sentinel_2_L2A_NDVI, etc. - 4-building-height.ipynb: Extracts building height features (e.g.,
Tallest_Building_*_HEIGHT,Average_Building_Height_*) using theHEIGHTROOFattribute from building footprint data. - 5-cooling-towers.ipynb: Processes cooling tower data (
NYC_Cooling_Tower_Registrations_20250224.csv) to calculate tower counts within various radii (e.g.,tower_count_100m). - 6-trees-data.ipynb: Processes street tree data (
2015_Street_Tree_Census_-_Tree_Data_20250221.csv) to calculate tree counts and average diameters within various radii (e.g.,tree_count_50m,tree_avg_diam_1000m). - 7-nyc-air-quality-surveillance-data.ipynb: Processes NYC air quality data (
Air_Quality_20250221.csv) to extract pollutant metrics (e.g.,pm2.5_avg_JJA,NO2_val_21Jul2021). - 8-energy-usage.ipynb: Processes energy and water consumption data (
Energy_and_Water_Data_Disclosure_for_Local_Law_84_2022__Data_for_Calendar_Year_2021__20250224.csv) to calculate features likesum_net_emissions_mtco2e_*. - 9-census-tract-2020.ipynb: Processes census tract data (
nyc_census_tracts.csv) to extract socio-economic metrics (e.g.,population_density,median_income). - 10-hyperlocal-temperature.ipynb: Processes hyperlocal temperature data (
Hyperlocal_Temperature_Monitoring_20250311.csv) to extract temperature statistics (e.g.,_AvgTemp_6,_UHI_8). - 11-street-condition.ipynb: Processes street pavement data (
StreetAssessmentRating) to extract features like pavement width and ratings (e.g.,Width,Rating_B). - 12-traffic-movement.ipynb: Processes traffic volume data (
Automated_Traffic_Volume_Counts_20250319.csv) to calculate traffic volume statistics (e.g.,Traffic_Volume_Avg,Traffic_Volume_med). - 13-air-quality-airnow.ipynb: Processes AirNow air quality data (
AQ) to extract additional pollutant metrics. - 14-elevation.ipynb: Uses the
pyhighpackage to assign elevation values (elevation) to training/validation datasets based on geographical coordinates. - 15-globalwind-maps.ipynb: Processes Global Wind Atlas data (
USA_air-density_10m.tif,USA_power-density_10m.tif) to extract features likeair_densityandpower_density. - 16-nclimgrid.ipynb: Processes NOAA nClimGrid data (
nclimgrid-monthly-202107.tif) to extract climate bands (e.g.,nclimgrid_band1tonclimgrid_band4). - 17-osmnx.ipynb: Uses OSMnx to extract OpenStreetMap features like
dist_to_road,roads_ratio_*,parks_ratio_*, etc. - Model_development.ipynb: Implements the machine learning pipeline, including preprocessing, feature selection, model training, hyperparameter optimization, and prediction generation.
- Training_data.csv: The original training dataset provided for the EY Data Science Challenge, containing latitude, longitude, and target UHI Index values.
- Validation_data.csv: The original validation dataset provided for the EY Data Science Challenge, containing latitude and longitude for prediction.
- Building_Footprint.kml: KML file containing building footprint data used for calculating building counts and density metrics.
- [Building Footprints_20250222.geojson](Building Footprints_20250222.geojson): GeoJSON file containing additional building footprint data.
- NY_Mesonet_Weather.xlsx: Excel file containing weather data from NY Mesonet, used to extract features like
air_temp_surfaceandwind_speed. - 2015_Street_Tree_Census_-_Tree_Data_20250221.csv: CSV file containing street tree census data from NYC Open Data, used for tree-related features.
- Air_Quality_20250221.csv: CSV file containing NYC air quality data, used for pollutant metrics.
- AQ: Directory containing AirNow air quality data, used for additional pollutant metrics.
- Airquality_Unique_geocode.xlsx: Excel file containing geocoded air quality data, used for spatial joining.
- NYC_Cooling_Tower_Registrations_20250224.csv: CSV file containing cooling tower registration data, used for tower count features.
- Energy_and_Water_Data_Disclosure_for_Local_Law_84_2022__Data_for_Calendar_Year_2021__20250224.csv: CSV file containing energy and water consumption data, used for energy-related features.
- nyc_census_tracts.csv: CSV file containing 2020 census tract data, used for socio-economic metrics.
- census_block_loc.csv: CSV file containing census block locations, used for additional spatial context.
- Hyperlocal_Temperature_Monitoring_20250311.csv: CSV file containing hyperlocal temperature data, used for temperature statistics.
- USA_air-density_10m.tif: TIFF file containing air density data from the Global Wind Atlas, used for
air_densityfeature. - USA_power-density_10m.tif: TIFF file containing power density data from the Global Wind Atlas, used for
power_densityfeature. - USA_wind-speed_10m.tif: TIFF file containing wind speed data from the Global Wind Atlas, used for wind-related features.
- StreetAssessmentRating: Directory containing street pavement rating data, used for pavement features.
- nclimgrid-monthly-202107.tif: TIFF file containing NOAA nClimGrid monthly data, used for climate bands.
- Automated_Traffic_Volume_Counts_20250319.csv: CSV file containing traffic volume data, used for traffic-related features.
- Sentinel_3: Directory containing Sentinel-3 satellite data TIFF files, used for brightness temperature and reflectance features.
- Sentinel2: Directory containing Sentinel-2 satellite data TIFF files, used for indices like NDVI, NDWI, etc.
- LSAT_8_221022: Directory containing Landsat 8 satellite data TIFF files, used for thermal and reflectance features.
- Landsat_LST.tiff: TIFF file containing Landsat Land Surface Temperature (LST) data, used for
lst_value. - Landsat_NDVI.tiff: TIFF file containing Landsat NDVI data, used for
lst_value_ndvi. - S2_DATA.tiff: TIFF file containing Sentinel-2 data, used for
s2_value. - Landsat_LST.ipynb: Notebook for processing Landsat LST data (likely a draft or alternative script).
- nyclion_25a: Directory containing NYC LION street centerline data, potentially used for road-related features.
- ExtraTrees_submission_20250320_002028.csv: Final submission file with predicted UHI Index values using the ExtraTreesRegressor model.
- blend_weights_20250320_002028.csv: CSV file containing blending weights for the model blending approach (ExtraTrees, XGBoost, LightGBM).
- et_feature_importance_20250320_002028.csv: CSV file containing feature importance scores from the ExtraTrees model.
- extratrees_params_20250320_002028.csv: CSV file containing the best hyperparameters for the ExtraTrees model.
- feature_stats_20250320_002028.csv: CSV file containing feature statistics (mean, std, missing percentage) for analysis.
- lgb_feature_importance_20250320_002028.csv: CSV file containing feature importance scores from the LightGBM model.
- lightgbm_params_20250320_002028.csv: CSV file containing the best hyperparameters for the LightGBM model.
- selected_features_20250320_002028.csv: CSV file listing the features selected by the ExtraTrees feature selection step.
- xgb_feature_importance_20250320_002028.csv: CSV file containing feature importance scores from the XGBoost model.
- xgboost_params_20250320_002028.csv: CSV file containing the best hyperparameters for the XGBoost model.
- LICENSE: The GPL-3.0 license file for the repository.
- README.md: The main README file (this document) providing an overview of the project.
- EYDS_Flowchart-2025-03-21-231135.svg: SVG file containing a flowchart of the EY Data Science Challenge workflow.
- Version.docx: Document containing version history or additional notes about the project.
This project demonstrates a comprehensive approach to UHI prediction, integrating diverse open-source datasets and applying advanced feature engineering and machine learning techniques. The engineered features capture the multifaceted nature of urban environments, while the machine learning pipeline ensures robust predictions. Future improvements could include incorporating more recent data, exploring deep learning models, or addressing spatial autocorrelation more explicitly.
Feel free to explore the datasets, review the detailed feature engineering steps, and examine the model training and validation process. For any questions, please open an issue or contact me at [[email protected]].
-
Added Repository Files Section:
- Created a new section called "Repository Files" that lists all files in the repository, categorized into Notebooks, Datasets, Output Files, and Other Files.
- Each file entry includes a brief description of its purpose and a link to its location in the repository (using relative Markdown links).
-
Linked to Uploaded Files:
- Included all uploaded files mentioned in your list, such as
Training_data.csv,Validation_data.csv,ExtraTrees_submission_20250320_002028.csv,feature_stats_20250320_002028.csv, etc. - Linked datasets like
NY_Mesonet_Weather.xlsx,Air_Quality_20250221.csv,Sentinel_3, etc., which are referenced in the Data Sources section and the notebooks.
- Included all uploaded files mentioned in your list, such as
-
Linked to Notebooks:
- Included all notebooks (e.g.,
1-building-density.ipynb,2-mesonet-weather.ipynb,2-sat-data-lsat-s2-s3.ipynb, etc.) with their purposes based on the feature engineering tasks they perform. - Reflected the renamed notebooks (
2-sat-data-lsat-s2-s3.ipynband9-census-tract-2020.ipynb) as per your upload history.
- Included all notebooks (e.g.,
-
Cross-Referenced with Data Sources:
- Ensured that the datasets mentioned in the "Data Sources & Feature Engineering" section (e.g.,
NY_Mesonet_Weather.xlsx,Sentinel2,Air_Quality_20250221.csv) are linked to their corresponding entries in the "Repository Files" section.
- Ensured that the datasets mentioned in the "Data Sources & Feature Engineering" section (e.g.,
-
Added Descriptions:
- Provided concise descriptions for each file to explain its role in the project (e.g., "Processes building footprint data to calculate building counts and density metrics" for
1-building-density.ipynb).
- Provided concise descriptions for each file to explain its role in the project (e.g., "Processes building footprint data to calculate building counts and density metrics" for
-
Organized by Category:
- Grouped files into logical categories (Notebooks, Datasets, Output Files, Other Files) to improve readability and navigation.