This repository contains the analysis pipeline for the City Segment Morphological Deprivation (CSMD) model, associated with an accepted-in-principle paper at Nature Cities.
The CSMD model maps morphologically deprived city segments across cities in Africa, Asia, and Latin America and the Caribbean, using City Segments v1 (CIESIN), IDEABench v2 labels, and Random Forest modelling.
Repository status: Accepted in principle at Nature Cities. The repository is being prepared for final publication. Large processed outputs (RF model, prediction GeoPackages, derived datasets) are hosted on Zenodo; raw source datasets remain with their original providers.
| Step | Description |
|---|---|
| A. Preprocessing & labelling | Standardise city-segment data; assign DUA-overlap training labels from IDEABench v2 |
| B. RF training & LOCO validation | VSURF variable selection; Random Forest training; leave-one-city-out validation |
| C. Global prediction / application | Apply final RF model to 5 000+ cities across the Global South |
| D. Comparative alignment | Contextual triangulation against SSI, Million Neighborhoods, and WRI Urban Land Use |
| E. Manuscript figures & tables | Generate all published figures and summary tables |
| F. Revision 2 coverage & omission analysis | UCDB / GHS-POP coverage quantification in response to reviewer requests |
citysegmentdeprivation/
├── 1_preprocessing/ # City-segment standardisation and label creation
├── 2_modelling/
│ ├── 01_training/ # VSURF, RF training, LOCO validation
│ └── 02_application/ # Global RF application and city-level summaries
├── 3_comparitive_analysis/ # SSI, MN, WRI comparison scripts and outputs
│ # (note: folder name retains historical spelling)
├── 4_Figures_Tables/ # Manuscript figure and summary-table notebooks
├── notebooks/
│ └── revision2_coverage/ # Post-revision UCDB/GHS-POP coverage notebooks
├── outputs/
│ ├── tables/revision2/ # Final and intermediate omission CSVs
│ └── figures/revision2/ # Revised coverage/omission figures
├── docs/ # Extended documentation (model decisions, data sources, etc.)
├── environment/ # Conda environment, pip requirements, R packages
├── zenodo/ # Zenodo deposit manifest
└── data/ # Small committed reference files
1_preprocessing/ — Prepares standardised city-segment features and benchmark-labelled training CSVs from IDEABench v2.
2_modelling/ — RF training (01_training/) and global application to 5 000+ cities (02_application/). City-level summary statistics and the final city deprivation CSV are committed under 02_application/summary_statistics/.
3_comparitive_analysis/ — Contextual alignment with SSI, Million Neighborhoods (MN), and WRI Urban Land Use datasets. The folder name preserves the historical spelling used throughout the project.
4_Figures_Tables/ — Notebooks that generate all manuscript figures and global summary tables from committed intermediate CSVs.
notebooks/revision2_coverage/ — Five notebooks added in Revision 2 to quantify UCDB/GHS-POP coverage and omissions. See notebooks/revision2_coverage/README.md for execution order and data requirements.
outputs/ — Final committed outputs: figures (outputs/figures/revision2/) and tables (outputs/tables/revision2/). Intermediate joins are under outputs/tables/revision2/intermediate/.
docs/ — Extended documentation including predictor definitions, model decisions, data-source citations, and a code-to-figure map.
environment/ — Reproducible environment files (environment.yml, requirements.txt, r_packages.R).
zenodo/ — Manifest of files deposited on Zenodo (DOI: 10.5281/zenodo.20486977).
Conda is recommended, particularly for geospatial dependencies (geopandas, rasterio, fiona).
conda env create -f environment/environment.yml
conda activate csmdOptional pip fallback (may require manual installation of geospatial system libraries):
pip install -r environment/requirements.txtR packages (for VSURF variable selection):
Rscript environment/r_packages.R| Property | Value |
|---|---|
| Training labels | IDEABench v2, 8 cities |
| Label rule | DUA spatial overlap ≥ 0.30 |
| Final predictors (8) | i5_par_area, i1_pop_area, B_AVG_SEG, i9_roads_par, i6_paru_area, PARU_A_SEG, B_CV_SEG, REGION_CODE / REG1_GHSL |
| Validation | Leave-one-city-out (LOCO) |
| Decision threshold | p(DUA) ≥ 0.40 |
| Comparative analyses | Contextual alignment / triangulation with SSI, MN, WRI — not strict ground-truth validation |
See docs/predictor_definitions.md and docs/model_decisions.md for full detail.
| What | Where |
|---|---|
| Final CSVs, selected outputs, scripts, notebooks, documentation | This GitHub repository |
RF model (.joblib), prediction GeoPackages, derived UCDB+GHS-POP GPKG, large comparison outputs |
Zenodo DOI: 10.5281/zenodo.20486977 |
| GHS Urban Centre Database (UCDB) 2019 V1.2 | JRC / GHSL official download |
| GHS-POP R2023A 2025 epoch, 100 m | JRC / GHSL official download |
Raw rasters, per-country prediction GeoPackages, model binaries, and other large files are excluded from the repository by .gitignore.
The five external input datasets below must be obtained from their original providers. They are not included in this repository or the project Zenodo package.
| Dataset | Purpose | Access | Local path |
|---|---|---|---|
| City Segments v1 | Segment polygons and built-environment predictors for preprocessing, model application, and figures | Harvard Dataverse — DOI: 10.7910/DVN/XLRSF0 | data_external/city_segments/ |
| IDEABench | Reference deprived/non-deprived labels for the eight benchmark training cities | DANS DataStation — DOI: 10.17026/PT/X4NJII (access conditions apply) | data_external/ideabench/ |
| Slum Severity Index (SSI) | External comparison product for service-related deprivation in sub-Saharan Africa | Zenodo — DOI: 10.5281/zenodo.14998570 | data_external/ssi_raw/ |
| Million Neighborhoods (MN) | External comparison product for building-to-street access complexity in sub-Saharan Africa | millionneighborhoods.africa/download | data_external/mn_raw/ |
| WRI Urban Land Use dataset | External comparison product for intra-urban land-use classes; informal subdivision and atomistic classes used for WRI comparison | GEE asset: projects/wri-datalab/urban_land_use/V1 | data_external/wri_raw/ |
See docs/data_availability.md for full dataset citations, download links, and local path conventions.
Full raw-to-final reproduction requires downloading external datasets from their original providers and running computationally intensive steps (rasterisation over the global 100 m GHS-POP grid, RF global application across 5 000+ cities). This is feasible but not a one-command process.
Paper-output reproduction is more accessible: download the Zenodo deposit, place processed files in the expected locations, and run the figure/table notebooks directly from committed intermediate CSVs.
Revision 2 coverage notebooks use a data_external/ convention for large external inputs. See notebooks/revision2_coverage/README.md for the directory layout and setup instructions.
| Document | Contents |
|---|---|
docs/code_figure_map.md |
Maps each manuscript figure/table to the notebook or script that generates it |
docs/model_decisions.md |
Rationale for modelling choices (threshold, predictors, validation) |
docs/predictor_definitions.md |
Definition and units of all RF predictors |
docs/data_availability.md |
Dataset citations, download URLs, and licence notes |
docs/coverage_and_omissions.md |
Coverage and omission methodology for the Revision 2 analysis |
notebooks/revision2_coverage/README.md |
Execution order and data requirements for coverage notebooks |
If you use or build upon this work, please cite the Zenodo deposit and, once published, the journal article.
Data deposit: Veeravalli, S. G. (2026). A Global, Standardized City Segment Morphological Deprivation (CSMD) Model: Preprocessing, Training, Predictions, and Cross-Dataset Comparisons (Version v4) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.20486977
Paper: Citation will be added after publication. Currently accepted in principle at Nature Cities.
This work is supported by:
- FORMAS (Swedish Research Council for Sustainable Development), project DEPRIMAP (2023-01210) — https://sola.kau.se/deprimap/
- NAISS (National Academic Infrastructure for Supercomputing in Sweden), partially funded by the Swedish Research Council through grant agreement no. 2022-06725 — computation for model training
- CIESIN for City Segments v1 (Harvard Dataverse, DOI: 10.7910/DVN/XLRSF0) and IDEAtlas for IDEABench (DOI: 10.17026/PT/X4NJII; paper DOI: 10.1016/j.rse.2026.115272)
