RGB Building Height Dataset - Preprocessed SSBH for Deep Learning

This repository provides reproduction scripts and the preprocessed/split dataset derived from the SSBH dataset, specifically focused on RGB-based building height prediction and segmentation. Using only Sentinel-2 RGB bands, it includes complete Jupyter notebooks for transforming raw satellite imagery into analysis-ready formats optimized for multi-task deep learning (building height regression + building mask segmentation) with advanced preprocessing pipelines (smart interpolation, 16-bit RGB enhancement, quality validation) and professional train/valid/test splits.

Reproduction Workflow

This repository provides complete reproducibility for the SSBH dataset preprocessing and splitting:

Prerequisites

Download the original SSBH dataset (67 cuts) from the official source
Extract each cut archive and place them in a raw/ folder in this repository directory

Expected Directory Structure

ssbh_prep_split/
├── raw/                           # Extract original SSBH cuts here
│   ├── cut1/                      # Extracted from cut1.zip
│   │   ├── sentinel2_AREA_b2/     # Blue band files
│   │   ├── sentinel2_AREA_b3/     # Green band files  
│   │   ├── sentinel2_AREA_b4/     # Red band files
│   │   └── _bhr/                  # Building height files
│   ├── cut2/                      # Extracted from cut2.zip
│   ├── ...
│   └── cut67/                     # Extracted from cut67.zip
├── ssbh_preprocessing.ipynb       # Step 1: Raw → Preprocessed
├── ssbh_prep_split.ipynb         # Step 2: Preprocessed → Train/Valid/Test
└── README.md

Processing Steps

Preprocessing: Run ssbh_preprocessing.ipynb to create the preprocessed/ folder with RGB, height, and mask data
Splitting: Run ssbh_prep_split.ipynb to create the prep_split/ folder with train/valid/test splits

Output Structure

After running both notebooks, you'll have:

ssbh_prep_split/
├── raw/                    # Your input data
├── preprocessed/           # Generated by preprocessing notebook
│   ├── rgb/               # RGB composite images
│   ├── height/            # Building height rasters
│   └── mask/              # Binary building masks
└── prep_split/            # Generated by split notebook
    ├── train/, valid/, test/  # Each containing rgb/, dsm/, sem/ folders
    └── split_visualization.png

Overview

This provides a comprehensively preprocessed and analysis-ready version of the SSBH dataset with professional train/valid/test splits, specifically designed for RGB-to-building-height deep learning tasks. The main contribution is the sophisticated preprocessing pipeline that transforms Sentinel-2 RGB satellite imagery into ML-ready formats for building height regression and building segmentation with enhanced quality and consistent formatting.

Original Repository: https://github.com/CSPON2035/SSBH

Key Features:

RGB-Focused Approach: Uses only Sentinel-2 RGB bands for building analysis tasks
Multi-Task Learning Ready: Simultaneous building height estimation and building mask segmentation
Advanced Preprocessing: Smart interpolation, 16-bit RGB enhancement, and quality-validated processing
Analysis-Ready Format: All invalid values imputed, consistent data types, optimized storage
Professional Splits: 70% Train / 10% Valid / 20% Test with fixed seed reproducibility
Coverage: 62 urban centers in China with 67 data cuts from 2019 Sentinel-2 RGB composites (~5,606 samples)

Download

The preprocessed and split dataset is available for download from Google Drive:

🔗 Download SSBH Split Dataset

The download includes all train/valid/test splits with RGB imagery, building height rasters, and binary masks organized in the directory structure described below.

Dataset Structure

_prep_split/
├── train/                 # Training set (70% of samples)
│   ├── rgb/               # RGB composite images (3-band GeoTIFF, uint16)
│   ├── dsm/               # Building height rasters (single-band GeoTIFF, float32)
│   └── sem/               # Binary building masks (single-band GeoTIFF, uint8)
├── valid/                 # Validation set (10% of samples)
│   ├── rgb/               # RGB composite images
│   ├── dsm/               # Building height rasters
│   └── sem/               # Binary building masks
├── test/                  # Testing set (20% of samples)
│   ├── rgb/               # RGB composite images
│   ├── dsm/               # Building height rasters
│   └── sem/               # Binary building masks
├── split_visualization.png # Sample visualization from each split (see Visualization section)
└── README.md              # This file

File Naming Convention

All files follow a consistent naming pattern across modalities:

Format: cut{XX}_{tile_id}.tif
Example: cut01_0001.tif, cut23_0045.tif

The same base filename appears across all modalities (rgb/, dsm/, sem/) within each split folder.

Folder Naming Convention: The dsm/ and sem/ naming follows standard remote sensing conventions where building height/mask detection tasks are subcategories of broader Digital Surface Model (DSM) and semantic segmentation tasks respectively. DSM typically refers to elevation models that capture surface heights including buildings, vegetation, and terrain, while semantic segmentation encompasses pixel-wise classification tasks. In this dataset, we focus specifically on the building components of these broader task categories.

Data Processing Pipeline

The dataset underwent comprehensive preprocessing to create analysis-ready data:

RGB Processing: Sentinel-2 optical bands (B4-Red, B3-Green, B2-Blue) were combined into RGB composites. Original reflectance values (0-10000) were normalized to [0,1], enhanced using percentile contrast stretching (2nd-98th percentile), then scaled to uint16 format (0-65535) for efficient storage.

Reflectance Normalization: Raw satellite data measures the proportion of electromagnetic radiation reflected back from Earth's surface. Sentinel-2 stores these as integers from 0-10,000 (representing 0-100% reflectance), which are converted to the standard 0-1 range by dividing by the scale factor (10,000).
Percentile Stretching: A contrast enhancement technique that improves image visualization by:
1. Finding the 2nd percentile (very dark values) and 98th percentile (very bright values) for each RGB channel
2. Remapping these to become the new 0 and 1 (black and white) points
3. Removing extreme outliers while enhancing contrast using the full dynamic range
4. Applied per channel to maintain color balance

Height Data Processing: Raw building height rasters from crowdsourced floor count data were processed with smart interpolation to handle invalid values. All NoData, NaN, and negative values were successfully imputed using nearest neighbor or griddata methods, ensuring all final height values are non-negative.

Building Mask Generation: Binary building masks were derived using a 0m height threshold. Connected component filtering (minimum 1 pixel) was applied while preserving original spatial characteristics through direct threshold conversion. The 1-pixel minimum is appropriate given that each pixel covers 100m² (10m × 10m) area, representing a substantial building footprint.

Split Configuration and Statistics

The dataset is split into 70% Train / 10% Valid / 20% Test using random seed 42 for reproducibility.

Sample Counts

Train: ~3,924 samples (70%)
Valid: ~560 samples (10%)
Test: ~1,122 samples (20%)

Key Features

Geographic Coverage: All 67 data cuts represented across splits
Reproducible: Fixed random seed (42) ensures consistent splits
Quality Validated: Complete modalities (RGB, height, mask) required for inclusion

Data Specifications

RGB Imagery (`*/rgb/` folders)

Source: Sentinel-2 bands B4 (red), B3 (green), B2 (blue) from 2019 annual composites
Format: 3-band GeoTIFF
Data type: uint16 (0-65535 range)
Spatial resolution: 10m per pixel
Tile size: 256×256 pixels
Processing:
- Reflectance normalization: Raw values (0-10000) converted to physical reflectance (0-1) representing 0-100% surface reflection
- Percentile contrast stretching: 2nd-98th percentile enhancement applied per channel to improve contrast while preserving color balance
- Dynamic range optimization: Final scaling to uint16 format for efficient storage
Usage: Convert to [0,1] range for model training: rgb_normalized = rgb_data.astype(np.float32) / 65535.0

Height Data (`*/dsm/` folders)

Source: Building height rasters derived from crowdsourced floor count data
Format: Single-band GeoTIFF
Data type: float32
Units: Meters
Spatial resolution: 10m per pixel
Tile size: 256×256 pixels
Data Range: All values ≥ 0 (no NoData values - all gaps successfully imputed)
Processing: Smart interpolation applied to handle original invalid values, ensuring all final height values are non-negative

Building Masks (`*/sem/` folders)

Source: Binary masks derived from processed height data using threshold-based segmentation
Format: Single-band GeoTIFF
Data type: uint8 (0: non-building, 1: building)
Spatial resolution: 10m per pixel
Tile size: 256×256 pixels
Processing: 0m height threshold applied with connected component filtering (minimum 1 pixel). No morphological operations applied as direct threshold conversion proved more suitable after testing various morphological approaches.

Data Usage Guidelines

📢 SSBH Dataset Loading Notes

⚠️ Special handling needed for SSBH RGB files (16-bit TIFF):

Primary method (faster):

rgb = cv2.imread(path, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)  # BGR→RGB
if rgb.dtype == np.uint16:
    rgb = rgb.astype(np.float32) / 65535.0  # 16-bit normalization

📝 CV2 Flags:
- IMREAD_ANYDEPTH: Preserves original bit depth (16-bit)
- IMREAD_COLOR: Loads in color mode (3 channels)
- | operator: Combines both flags (bitwise OR)

Fallback:

rgb = tifffile.imread(path)
if rgb.dtype == np.uint16:
    rgb = rgb.astype(np.float32) / 65535.0

✅ DSM & SEM files: Load normally with PIL

dsm = np.array(Image.open(dsm_path)).astype(np.float32)
sem = np.array(Image.open(sem_path)).astype(np.uint8)

🌍 Note: SSBH is an example of true geospatial 16-bit RGB TIFFs [0,65535], unlike many other typical remote sensing datasets that may use TIFF format but are in fact represented as standard [0,255] values.

⚡ Learning Rate Impact: The necessity to normalize 16-bit RGB imagery to [0,1] range will most probably impact your typical learning rate. You may need a significantly smaller rate compared to standard uint8 RGB datasets in the [0,255] range.

Training and Evaluation Guidelines

Recommended Approach

Training: Use train split (70%) for model training
Valid: Use valid split (10%) for hyperparameter tuning and model selection
Testing: Use test split (20%) for final performance evaluation only

Evaluation Metrics

For Height Estimation: RMSE, MAE (in meters), R² coefficient
For Building Segmentation: IoU, F1 Score, Precision, Recall

Data Characteristics to Consider

Geographic scope: Chinese urban areas (62 cities, 67 data cuts)
Temporal consistency: 2019 Sentinel-2 composites throughout
Height accuracy: Based on crowdsourced building floor count data
Spatial resolution: 10m per pixel, 256×256 pixel tiles
Building focus: Original height data represents buildings exclusively

Visualization

Split Sample Visualization

A comprehensive visualization shows samples from each split with all modalities:

The visualization displays:

3 samples from each split (train/valid/test)
All modalities (RGB, Height, Building Mask) for each sample
Clear visual separation between different samples
Sample counts for each split labeled

Important Processing Notes

Key Technical Considerations

RGB Data: Stored in uint16 format, requires normalization to [0,1] for training. May require smaller learning rates than standard uint8 data.
Reflectance Values: Represent true physical surface reflectance (0-100%) derived from Sentinel-2's calibrated measurements, providing more meaningful spectral information than simple pixel intensities.
Contrast Enhancement: Percentile stretching (2nd-98th percentile) enhances visual contrast while avoiding saturation from extreme outliers, making imagery more suitable for both visualization and model training.
Height Data: All values are non-negative (≥0m) after successful interpolation of original invalid values.
Building Masks: Direct threshold conversion (0m) with no morphological operations, preserving original spatial characteristics.
Spatial Context: Each 256×256 pixel tile covers 2.56km × 2.56km at 10m resolution, suitable for building-level analysis.

Known Limitations

Height accuracy depends on crowdsourced floor count data quality
Geographic scope limited to Chinese urban areas
10m pixel size limits detection of very small buildings
RGB-only spectral information may limit some applications

Citation

Paper Link: https://www.spiedigitallibrary.org/conference-proceedings-of-spie/13506/135061G/SSBH--a-remote-sensing-building-height-dataset-for-deep/10.1117/12.3057532.full

If you use this split dataset, please cite the original SSBH paper:

@inproceedings{ma2025ssbh,
  title={SSBH: a remote sensing building height dataset for deep learning},
  author={Ma, Yao and Wang, Hao and Cao, Changhao},
  booktitle={Sixth International Conference on Geoscience and Remote Sensing Mapping (GRSM 2024)},
  volume={13506},
  pages={388--394},
  year={2025},
  organization={SPIE}
}

Processing History

Data Source: SSBH dataset covering 62 urban centers in China (67 data cuts) with 2019 Sentinel-2 composites
Preprocessing: RGB enhancement, height interpolation (100% success rate), and binary mask generation
Splitting: Random 70-10-20 split with seed 42 for reproducibility
Total Samples: ~5,606 samples (after validation and quality checks)
Version: v1.0 (2025)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
prep_split		prep_split
preprocessed		preprocessed
raw		raw
.gitignore		.gitignore
README.md		README.md
split_visualization.png		split_visualization.png
ssbh_prep_split.ipynb		ssbh_prep_split.ipynb
ssbh_preprocessing.ipynb		ssbh_preprocessing.ipynb

ahmad-naghavi-ozu/SSBH-RGB-Building-Analysis

Folders and files

Latest commit

History

Repository files navigation

RGB Building Height Dataset - Preprocessed SSBH for Deep Learning

Reproduction Workflow

Prerequisites

Expected Directory Structure

Processing Steps

Output Structure

Overview

Download

Dataset Structure

File Naming Convention

Data Processing Pipeline

Split Configuration and Statistics

Sample Counts

Key Features

Data Specifications

RGB Imagery (*/rgb/ folders)

Height Data (*/dsm/ folders)

Building Masks (*/sem/ folders)

Data Usage Guidelines

📢 SSBH Dataset Loading Notes

Training and Evaluation Guidelines

Recommended Approach

Evaluation Metrics

Data Characteristics to Consider

Visualization

Split Sample Visualization

Important Processing Notes

Key Technical Considerations

Known Limitations

Citation

Processing History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

RGB Imagery (`*/rgb/` folders)

Height Data (`*/dsm/` folders)

Building Masks (`*/sem/` folders)

Packages