Skip to content

sumit9318/datasciencecoursera

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logistic Regression — Landslide Susceptibility Mapping

A Python pipeline for producing landslide susceptibility maps from GIS conditioning factors (slope, elevation, lithology, distance to streams, …) and a landslide inventory. The model uses logistic regression and reports both statistical inference (statsmodels: coefficients, p-values, odds ratios, McFadden pseudo-R²) and predictive performance (scikit-learn: ROC/AUC, cross-validation, confusion matrix). Output is a GeoTIFF probability surface plus a 5-class susceptibility map.

Installation

pip install -r requirements.txt

System dependency: rasterio, geopandas, and fiona require a working GDAL installation (e.g. apt install gdal-bin libgdal-dev on Debian/Ubuntu, or use the conda-forge channel: conda install -c conda-forge gdal rasterio geopandas).

Quick start (synthetic demo)

python main.py --demo

This generates a small synthetic study area under data/demo/ (200 × 200 pixels, 7 conditioning factors, 150 landslide presence points) and runs the full pipeline end-to-end. It produces:

  • outputs/susceptibility.tif — landslide probability per pixel (0–1)
  • outputs/susceptibility_classes.tif — 5 classes (Very Low … Very High)
  • outputs/stats_report.txt — coefficients, p-values, odds ratios, pseudo-R²
  • figures/roc.png — ROC curve with AUC
  • figures/feature_importance.png — standardized coefficient bar chart

On the synthetic data the model typically achieves AUC ≈ 0.85–0.95.

Real-data usage

python main.py \
    --rasters /path/to/factors_folder \
    --inventory /path/to/landslides.shp \
    --categorical lithology,landuse \
    --buffer 500 \
    --classes 5

Input data requirements

  • Conditioning factor rasters — one GeoTIFF per factor in a single folder. All rasters must share the same CRS, transform, width, and height. The filename (without extension) becomes the factor name.
  • Landslide inventory — shapefile or GeoJSON of points or polygons (polygons are converted to centroids). Must have a defined CRS.
  • Categorical factors — pass via --categorical as a comma-separated list of factor names (matching the GeoTIFF basenames). These are one-hot encoded; remaining factors are standardized.

CLI options

Flag Default Purpose
--demo off Generate synthetic data and run end-to-end
--rasters Folder of conditioning factor GeoTIFFs
--inventory Path to landslide inventory file
--categorical "" Comma-separated names of categorical factors
--out-dir outputs/ Where to write GeoTIFFs and stats report
--fig-dir figures/ Where to write ROC + importance plots
--buffer 500 Exclusion buffer (CRS units) for absence sampling
--test-size 0.25 Held-out fraction
--cv-folds 5 Stratified k-fold AUC cross-validation
--classes 5 Number of susceptibility classes
--class-method quantile quantile or jenks (Natural Breaks)
--seed 42 RNG seed

Project layout

.
├── main.py                       # CLI entry point
├── requirements.txt
├── src/landslide/
│   ├── io_raster.py              # raster stack load / write / nodata mask
│   ├── io_inventory.py           # inventory load + CRS reprojection
│   ├── sampling.py               # presence/absence point sampling
│   ├── features.py               # extraction, OHE, scaling, train/test split
│   ├── model_stats.py            # statsmodels Logit (inference)
│   ├── model_sklearn.py          # sklearn LR (prediction + CV)
│   ├── predict.py                # pixel-wise prediction → GeoTIFF
│   ├── classify.py               # quantile / Jenks susceptibility classes
│   ├── plots.py                  # ROC + feature importance plots
│   └── synthetic.py              # demo data generator
├── data/demo/                    # written by `--demo`
├── figures/                      # ROC + importance plots
└── outputs/                      # susceptibility.tif, classes.tif, report

How it works

  1. Load every GeoTIFF in --rasters into a single (bands, H, W) stack and validate alignment.
  2. Load the inventory and reproject to the raster CRS; convert polygon geometries to centroids.
  3. Sample absence points uniformly from valid pixels, excluding a buffer around presence locations (default 500 CRS units), 1:1 with presences.
  4. Extract raster values at every sample location; drop rows that hit any nodata pixel.
  5. Encode categorical factors with one-hot encoding and standardize numerics; produce a stratified 75/25 train/test split.
  6. Fit statsmodels.Logit on the training set → write stats_report.txt with the full summary, tidy coefficient table, odds-ratios, and McFadden pseudo-R².
  7. Fit sklearn.LogisticRegression (L2-regularized) on the training set; report 5-fold CV AUC, held-out accuracy, AUC, and confusion matrix.
  8. Plot ROC curve and standardized-coefficient bar chart.
  9. Predict landslide probability for every valid pixel in the study area and write susceptibility.tif.
  10. Classify the surface into 5 susceptibility levels (quantile or Jenks Natural Breaks) and write susceptibility_classes.tif.

Interpretation notes

  • Odds ratios (exp(coef)) describe how the odds of a landslide change per 1-σ increase in a standardized factor (or the effect of a category vs. the reference class).
  • AUC is the primary skill metric: 0.7–0.8 = acceptable, 0.8–0.9 = excellent, > 0.9 = outstanding (Hosmer & Lemeshow).
  • Susceptibility classes are relative — they rank pixels within a single study area and should not be compared across areas without recalibration.

Limitations

  • Logistic regression assumes linear effects on the log-odds scale and independence between samples — spatial autocorrelation is not modelled.
  • The default 1:1 absence:presence ratio is a common but arbitrary choice; results are sensitive to absence sampling strategy.
  • The synthetic generator produces signal that is generated by a logistic model, so AUC on demo data overstates real-world performance.

About

assignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages