Skip to content

Preprocessed and split version of the SSBH remote sensing dataset for building height estimation, including RGB composites, height maps, and building masks with train/valid/test manifests and ready-to-train scripts.

Notifications You must be signed in to change notification settings

ahmad-naghavi-ozu/SSBH-RGB-Building-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RGB Building Height Dataset - Preprocessed SSBH for Deep Learning

This repository provides reproduction scripts and the preprocessed/split dataset derived from the SSBH dataset, specifically focused on RGB-based building height prediction and segmentation. Using only Sentinel-2 RGB bands, it includes complete Jupyter notebooks for transforming raw satellite imagery into analysis-ready formats optimized for multi-task deep learning (building height regression + building mask segmentation) with advanced preprocessing pipelines (smart interpolation, 16-bit RGB enhancement, quality validation) and professional train/valid/test splits.

Reproduction Workflow

This repository provides complete reproducibility for the SSBH dataset preprocessing and splitting:

Prerequisites

  1. Download the original SSBH dataset (67 cuts) from the official source
  2. Extract each cut archive and place them in a raw/ folder in this repository directory

Expected Directory Structure

ssbh_prep_split/
β”œβ”€β”€ raw/                           # Extract original SSBH cuts here
β”‚   β”œβ”€β”€ cut1/                      # Extracted from cut1.zip
β”‚   β”‚   β”œβ”€β”€ sentinel2_AREA_b2/     # Blue band files
β”‚   β”‚   β”œβ”€β”€ sentinel2_AREA_b3/     # Green band files  
β”‚   β”‚   β”œβ”€β”€ sentinel2_AREA_b4/     # Red band files
β”‚   β”‚   └── _bhr/                  # Building height files
β”‚   β”œβ”€β”€ cut2/                      # Extracted from cut2.zip
β”‚   β”œβ”€β”€ ...
β”‚   └── cut67/                     # Extracted from cut67.zip
β”œβ”€β”€ ssbh_preprocessing.ipynb       # Step 1: Raw β†’ Preprocessed
β”œβ”€β”€ ssbh_prep_split.ipynb         # Step 2: Preprocessed β†’ Train/Valid/Test
└── README.md

Processing Steps

  1. Preprocessing: Run ssbh_preprocessing.ipynb to create the preprocessed/ folder with RGB, height, and mask data
  2. Splitting: Run ssbh_prep_split.ipynb to create the prep_split/ folder with train/valid/test splits

Output Structure

After running both notebooks, you'll have:

ssbh_prep_split/
β”œβ”€β”€ raw/                    # Your input data
β”œβ”€β”€ preprocessed/           # Generated by preprocessing notebook
β”‚   β”œβ”€β”€ rgb/               # RGB composite images
β”‚   β”œβ”€β”€ height/            # Building height rasters
β”‚   └── mask/              # Binary building masks
└── prep_split/            # Generated by split notebook
    β”œβ”€β”€ train/, valid/, test/  # Each containing rgb/, dsm/, sem/ folders
    └── split_visualization.png

Overview

This provides a comprehensively preprocessed and analysis-ready version of the SSBH dataset with professional train/valid/test splits, specifically designed for RGB-to-building-height deep learning tasks. The main contribution is the sophisticated preprocessing pipeline that transforms Sentinel-2 RGB satellite imagery into ML-ready formats for building height regression and building segmentation with enhanced quality and consistent formatting.

Original Repository: https://github.com/CSPON2035/SSBH

Key Features:

  • RGB-Focused Approach: Uses only Sentinel-2 RGB bands for building analysis tasks
  • Multi-Task Learning Ready: Simultaneous building height estimation and building mask segmentation
  • Advanced Preprocessing: Smart interpolation, 16-bit RGB enhancement, and quality-validated processing
  • Analysis-Ready Format: All invalid values imputed, consistent data types, optimized storage
  • Professional Splits: 70% Train / 10% Valid / 20% Test with fixed seed reproducibility
  • Coverage: 62 urban centers in China with 67 data cuts from 2019 Sentinel-2 RGB composites (~5,606 samples)

Download

The preprocessed and split dataset is available for download from Google Drive:

πŸ”— Download SSBH Split Dataset

The download includes all train/valid/test splits with RGB imagery, building height rasters, and binary masks organized in the directory structure described below.

Dataset Structure

_prep_split/
β”œβ”€β”€ train/                 # Training set (70% of samples)
β”‚   β”œβ”€β”€ rgb/               # RGB composite images (3-band GeoTIFF, uint16)
β”‚   β”œβ”€β”€ dsm/               # Building height rasters (single-band GeoTIFF, float32)
β”‚   └── sem/               # Binary building masks (single-band GeoTIFF, uint8)
β”œβ”€β”€ valid/                 # Validation set (10% of samples)
β”‚   β”œβ”€β”€ rgb/               # RGB composite images
β”‚   β”œβ”€β”€ dsm/               # Building height rasters
β”‚   └── sem/               # Binary building masks
β”œβ”€β”€ test/                  # Testing set (20% of samples)
β”‚   β”œβ”€β”€ rgb/               # RGB composite images
β”‚   β”œβ”€β”€ dsm/               # Building height rasters
β”‚   └── sem/               # Binary building masks
β”œβ”€β”€ split_visualization.png # Sample visualization from each split (see Visualization section)
└── README.md              # This file

File Naming Convention

All files follow a consistent naming pattern across modalities:

  • Format: cut{XX}_{tile_id}.tif
  • Example: cut01_0001.tif, cut23_0045.tif

The same base filename appears across all modalities (rgb/, dsm/, sem/) within each split folder.

Folder Naming Convention: The dsm/ and sem/ naming follows standard remote sensing conventions where building height/mask detection tasks are subcategories of broader Digital Surface Model (DSM) and semantic segmentation tasks respectively. DSM typically refers to elevation models that capture surface heights including buildings, vegetation, and terrain, while semantic segmentation encompasses pixel-wise classification tasks. In this dataset, we focus specifically on the building components of these broader task categories.

Data Processing Pipeline

The dataset underwent comprehensive preprocessing to create analysis-ready data:

RGB Processing: Sentinel-2 optical bands (B4-Red, B3-Green, B2-Blue) were combined into RGB composites. Original reflectance values (0-10000) were normalized to [0,1], enhanced using percentile contrast stretching (2nd-98th percentile), then scaled to uint16 format (0-65535) for efficient storage.

  • Reflectance Normalization: Raw satellite data measures the proportion of electromagnetic radiation reflected back from Earth's surface. Sentinel-2 stores these as integers from 0-10,000 (representing 0-100% reflectance), which are converted to the standard 0-1 range by dividing by the scale factor (10,000).

  • Percentile Stretching: A contrast enhancement technique that improves image visualization by:

    1. Finding the 2nd percentile (very dark values) and 98th percentile (very bright values) for each RGB channel
    2. Remapping these to become the new 0 and 1 (black and white) points
    3. Removing extreme outliers while enhancing contrast using the full dynamic range
    4. Applied per channel to maintain color balance

Height Data Processing: Raw building height rasters from crowdsourced floor count data were processed with smart interpolation to handle invalid values. All NoData, NaN, and negative values were successfully imputed using nearest neighbor or griddata methods, ensuring all final height values are non-negative.

Building Mask Generation: Binary building masks were derived using a 0m height threshold. Connected component filtering (minimum 1 pixel) was applied while preserving original spatial characteristics through direct threshold conversion. The 1-pixel minimum is appropriate given that each pixel covers 100mΒ² (10m Γ— 10m) area, representing a substantial building footprint.

Split Configuration and Statistics

The dataset is split into 70% Train / 10% Valid / 20% Test using random seed 42 for reproducibility.

Sample Counts

  • Train: ~3,924 samples (70%)
  • Valid: ~560 samples (10%)
  • Test: ~1,122 samples (20%)

Key Features

  • Geographic Coverage: All 67 data cuts represented across splits
  • Reproducible: Fixed random seed (42) ensures consistent splits
  • Quality Validated: Complete modalities (RGB, height, mask) required for inclusion

Data Specifications

RGB Imagery (*/rgb/ folders)

  • Source: Sentinel-2 bands B4 (red), B3 (green), B2 (blue) from 2019 annual composites
  • Format: 3-band GeoTIFF
  • Data type: uint16 (0-65535 range)
  • Spatial resolution: 10m per pixel
  • Tile size: 256Γ—256 pixels
  • Processing:
    • Reflectance normalization: Raw values (0-10000) converted to physical reflectance (0-1) representing 0-100% surface reflection
    • Percentile contrast stretching: 2nd-98th percentile enhancement applied per channel to improve contrast while preserving color balance
    • Dynamic range optimization: Final scaling to uint16 format for efficient storage
  • Usage: Convert to [0,1] range for model training: rgb_normalized = rgb_data.astype(np.float32) / 65535.0

Height Data (*/dsm/ folders)

  • Source: Building height rasters derived from crowdsourced floor count data
  • Format: Single-band GeoTIFF
  • Data type: float32
  • Units: Meters
  • Spatial resolution: 10m per pixel
  • Tile size: 256Γ—256 pixels
  • Data Range: All values β‰₯ 0 (no NoData values - all gaps successfully imputed)
  • Processing: Smart interpolation applied to handle original invalid values, ensuring all final height values are non-negative

Building Masks (*/sem/ folders)

  • Source: Binary masks derived from processed height data using threshold-based segmentation
  • Format: Single-band GeoTIFF
  • Data type: uint8 (0: non-building, 1: building)
  • Spatial resolution: 10m per pixel
  • Tile size: 256Γ—256 pixels
  • Processing: 0m height threshold applied with connected component filtering (minimum 1 pixel). No morphological operations applied as direct threshold conversion proved more suitable after testing various morphological approaches.

Data Usage Guidelines

πŸ“’ SSBH Dataset Loading Notes

⚠️ Special handling needed for SSBH RGB files (16-bit TIFF):

Primary method (faster):

rgb = cv2.imread(path, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)  # BGR→RGB
if rgb.dtype == np.uint16:
    rgb = rgb.astype(np.float32) / 65535.0  # 16-bit normalization
  • πŸ“ CV2 Flags:
    • IMREAD_ANYDEPTH: Preserves original bit depth (16-bit)
    • IMREAD_COLOR: Loads in color mode (3 channels)
    • | operator: Combines both flags (bitwise OR)

Fallback:

rgb = tifffile.imread(path)
if rgb.dtype == np.uint16:
    rgb = rgb.astype(np.float32) / 65535.0

βœ… DSM & SEM files: Load normally with PIL

dsm = np.array(Image.open(dsm_path)).astype(np.float32)
sem = np.array(Image.open(sem_path)).astype(np.uint8)

🌍 Note: SSBH is an example of true geospatial 16-bit RGB TIFFs [0,65535], unlike many other typical remote sensing datasets that may use TIFF format but are in fact represented as standard [0,255] values.

⚑ Learning Rate Impact: The necessity to normalize 16-bit RGB imagery to [0,1] range will most probably impact your typical learning rate. You may need a significantly smaller rate compared to standard uint8 RGB datasets in the [0,255] range.

Training and Evaluation Guidelines

Recommended Approach

  • Training: Use train split (70%) for model training
  • Valid: Use valid split (10%) for hyperparameter tuning and model selection
  • Testing: Use test split (20%) for final performance evaluation only

Evaluation Metrics

For Height Estimation: RMSE, MAE (in meters), RΒ² coefficient
For Building Segmentation: IoU, F1 Score, Precision, Recall

Data Characteristics to Consider

  • Geographic scope: Chinese urban areas (62 cities, 67 data cuts)
  • Temporal consistency: 2019 Sentinel-2 composites throughout
  • Height accuracy: Based on crowdsourced building floor count data
  • Spatial resolution: 10m per pixel, 256Γ—256 pixel tiles
  • Building focus: Original height data represents buildings exclusively

Visualization

Split Sample Visualization

A comprehensive visualization shows samples from each split with all modalities:

Split Visualization

The visualization displays:

  • 3 samples from each split (train/valid/test)
  • All modalities (RGB, Height, Building Mask) for each sample
  • Clear visual separation between different samples
  • Sample counts for each split labeled

Important Processing Notes

Key Technical Considerations

  • RGB Data: Stored in uint16 format, requires normalization to [0,1] for training. May require smaller learning rates than standard uint8 data.
  • Reflectance Values: Represent true physical surface reflectance (0-100%) derived from Sentinel-2's calibrated measurements, providing more meaningful spectral information than simple pixel intensities.
  • Contrast Enhancement: Percentile stretching (2nd-98th percentile) enhances visual contrast while avoiding saturation from extreme outliers, making imagery more suitable for both visualization and model training.
  • Height Data: All values are non-negative (β‰₯0m) after successful interpolation of original invalid values.
  • Building Masks: Direct threshold conversion (0m) with no morphological operations, preserving original spatial characteristics.
  • Spatial Context: Each 256Γ—256 pixel tile covers 2.56km Γ— 2.56km at 10m resolution, suitable for building-level analysis.

Known Limitations

  • Height accuracy depends on crowdsourced floor count data quality
  • Geographic scope limited to Chinese urban areas
  • 10m pixel size limits detection of very small buildings
  • RGB-only spectral information may limit some applications

Citation

Paper Link: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/13506/135061G/SSBH--a-remote-sensing-building-height-dataset-for-deep/10.1117/12.3057532.full

If you use this split dataset, please cite the original SSBH paper:

@inproceedings{ma2025ssbh,
  title={SSBH: a remote sensing building height dataset for deep learning},
  author={Ma, Yao and Wang, Hao and Cao, Changhao},
  booktitle={Sixth International Conference on Geoscience and Remote Sensing Mapping (GRSM 2024)},
  volume={13506},
  pages={388--394},
  year={2025},
  organization={SPIE}
}

Processing History

  • Data Source: SSBH dataset covering 62 urban centers in China (67 data cuts) with 2019 Sentinel-2 composites
  • Preprocessing: RGB enhancement, height interpolation (100% success rate), and binary mask generation
  • Splitting: Random 70-10-20 split with seed 42 for reproducibility
  • Total Samples: ~5,606 samples (after validation and quality checks)
  • Version: v1.0 (2025)