This repository provides reproduction scripts and the preprocessed/split dataset derived from the SSBH dataset, specifically focused on RGB-based building height prediction and segmentation. Using only Sentinel-2 RGB bands, it includes complete Jupyter notebooks for transforming raw satellite imagery into analysis-ready formats optimized for multi-task deep learning (building height regression + building mask segmentation) with advanced preprocessing pipelines (smart interpolation, 16-bit RGB enhancement, quality validation) and professional train/valid/test splits.
This repository provides complete reproducibility for the SSBH dataset preprocessing and splitting:
- Download the original SSBH dataset (67 cuts) from the official source
- Extract each cut archive and place them in a
raw/folder in this repository directory
ssbh_prep_split/
βββ raw/ # Extract original SSBH cuts here
β βββ cut1/ # Extracted from cut1.zip
β β βββ sentinel2_AREA_b2/ # Blue band files
β β βββ sentinel2_AREA_b3/ # Green band files
β β βββ sentinel2_AREA_b4/ # Red band files
β β βββ _bhr/ # Building height files
β βββ cut2/ # Extracted from cut2.zip
β βββ ...
β βββ cut67/ # Extracted from cut67.zip
βββ ssbh_preprocessing.ipynb # Step 1: Raw β Preprocessed
βββ ssbh_prep_split.ipynb # Step 2: Preprocessed β Train/Valid/Test
βββ README.md
- Preprocessing: Run
ssbh_preprocessing.ipynbto create thepreprocessed/folder with RGB, height, and mask data - Splitting: Run
ssbh_prep_split.ipynbto create theprep_split/folder with train/valid/test splits
After running both notebooks, you'll have:
ssbh_prep_split/
βββ raw/ # Your input data
βββ preprocessed/ # Generated by preprocessing notebook
β βββ rgb/ # RGB composite images
β βββ height/ # Building height rasters
β βββ mask/ # Binary building masks
βββ prep_split/ # Generated by split notebook
βββ train/, valid/, test/ # Each containing rgb/, dsm/, sem/ folders
βββ split_visualization.png
This provides a comprehensively preprocessed and analysis-ready version of the SSBH dataset with professional train/valid/test splits, specifically designed for RGB-to-building-height deep learning tasks. The main contribution is the sophisticated preprocessing pipeline that transforms Sentinel-2 RGB satellite imagery into ML-ready formats for building height regression and building segmentation with enhanced quality and consistent formatting.
Original Repository: https://github.com/CSPON2035/SSBH
Key Features:
- RGB-Focused Approach: Uses only Sentinel-2 RGB bands for building analysis tasks
- Multi-Task Learning Ready: Simultaneous building height estimation and building mask segmentation
- Advanced Preprocessing: Smart interpolation, 16-bit RGB enhancement, and quality-validated processing
- Analysis-Ready Format: All invalid values imputed, consistent data types, optimized storage
- Professional Splits: 70% Train / 10% Valid / 20% Test with fixed seed reproducibility
- Coverage: 62 urban centers in China with 67 data cuts from 2019 Sentinel-2 RGB composites (~5,606 samples)
The preprocessed and split dataset is available for download from Google Drive:
π Download SSBH Split Dataset
The download includes all train/valid/test splits with RGB imagery, building height rasters, and binary masks organized in the directory structure described below.
_prep_split/
βββ train/ # Training set (70% of samples)
β βββ rgb/ # RGB composite images (3-band GeoTIFF, uint16)
β βββ dsm/ # Building height rasters (single-band GeoTIFF, float32)
β βββ sem/ # Binary building masks (single-band GeoTIFF, uint8)
βββ valid/ # Validation set (10% of samples)
β βββ rgb/ # RGB composite images
β βββ dsm/ # Building height rasters
β βββ sem/ # Binary building masks
βββ test/ # Testing set (20% of samples)
β βββ rgb/ # RGB composite images
β βββ dsm/ # Building height rasters
β βββ sem/ # Binary building masks
βββ split_visualization.png # Sample visualization from each split (see Visualization section)
βββ README.md # This file
All files follow a consistent naming pattern across modalities:
- Format:
cut{XX}_{tile_id}.tif - Example:
cut01_0001.tif,cut23_0045.tif
The same base filename appears across all modalities (rgb/, dsm/, sem/) within each split folder.
Folder Naming Convention: The dsm/ and sem/ naming follows standard remote sensing conventions where building height/mask detection tasks are subcategories of broader Digital Surface Model (DSM) and semantic segmentation tasks respectively. DSM typically refers to elevation models that capture surface heights including buildings, vegetation, and terrain, while semantic segmentation encompasses pixel-wise classification tasks. In this dataset, we focus specifically on the building components of these broader task categories.
The dataset underwent comprehensive preprocessing to create analysis-ready data:
RGB Processing: Sentinel-2 optical bands (B4-Red, B3-Green, B2-Blue) were combined into RGB composites. Original reflectance values (0-10000) were normalized to [0,1], enhanced using percentile contrast stretching (2nd-98th percentile), then scaled to uint16 format (0-65535) for efficient storage.
-
Reflectance Normalization: Raw satellite data measures the proportion of electromagnetic radiation reflected back from Earth's surface. Sentinel-2 stores these as integers from 0-10,000 (representing 0-100% reflectance), which are converted to the standard 0-1 range by dividing by the scale factor (10,000).
-
Percentile Stretching: A contrast enhancement technique that improves image visualization by:
- Finding the 2nd percentile (very dark values) and 98th percentile (very bright values) for each RGB channel
- Remapping these to become the new 0 and 1 (black and white) points
- Removing extreme outliers while enhancing contrast using the full dynamic range
- Applied per channel to maintain color balance
Height Data Processing: Raw building height rasters from crowdsourced floor count data were processed with smart interpolation to handle invalid values. All NoData, NaN, and negative values were successfully imputed using nearest neighbor or griddata methods, ensuring all final height values are non-negative.
Building Mask Generation: Binary building masks were derived using a 0m height threshold. Connected component filtering (minimum 1 pixel) was applied while preserving original spatial characteristics through direct threshold conversion. The 1-pixel minimum is appropriate given that each pixel covers 100mΒ² (10m Γ 10m) area, representing a substantial building footprint.
The dataset is split into 70% Train / 10% Valid / 20% Test using random seed 42 for reproducibility.
- Train: ~3,924 samples (70%)
- Valid: ~560 samples (10%)
- Test: ~1,122 samples (20%)
- Geographic Coverage: All 67 data cuts represented across splits
- Reproducible: Fixed random seed (42) ensures consistent splits
- Quality Validated: Complete modalities (RGB, height, mask) required for inclusion
- Source: Sentinel-2 bands B4 (red), B3 (green), B2 (blue) from 2019 annual composites
- Format: 3-band GeoTIFF
- Data type: uint16 (0-65535 range)
- Spatial resolution: 10m per pixel
- Tile size: 256Γ256 pixels
- Processing:
- Reflectance normalization: Raw values (0-10000) converted to physical reflectance (0-1) representing 0-100% surface reflection
- Percentile contrast stretching: 2nd-98th percentile enhancement applied per channel to improve contrast while preserving color balance
- Dynamic range optimization: Final scaling to uint16 format for efficient storage
- Usage: Convert to [0,1] range for model training:
rgb_normalized = rgb_data.astype(np.float32) / 65535.0
- Source: Building height rasters derived from crowdsourced floor count data
- Format: Single-band GeoTIFF
- Data type: float32
- Units: Meters
- Spatial resolution: 10m per pixel
- Tile size: 256Γ256 pixels
- Data Range: All values β₯ 0 (no NoData values - all gaps successfully imputed)
- Processing: Smart interpolation applied to handle original invalid values, ensuring all final height values are non-negative
- Source: Binary masks derived from processed height data using threshold-based segmentation
- Format: Single-band GeoTIFF
- Data type: uint8 (0: non-building, 1: building)
- Spatial resolution: 10m per pixel
- Tile size: 256Γ256 pixels
- Processing: 0m height threshold applied with connected component filtering (minimum 1 pixel). No morphological operations applied as direct threshold conversion proved more suitable after testing various morphological approaches.
Primary method (faster):
rgb = cv2.imread(path, cv2.IMREAD_ANYDEPTH | cv2.IMREAD_COLOR)
rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB) # BGRβRGB
if rgb.dtype == np.uint16:
rgb = rgb.astype(np.float32) / 65535.0 # 16-bit normalization- π CV2 Flags:
IMREAD_ANYDEPTH: Preserves original bit depth (16-bit)IMREAD_COLOR: Loads in color mode (3 channels)|operator: Combines both flags (bitwise OR)
Fallback:
rgb = tifffile.imread(path)
if rgb.dtype == np.uint16:
rgb = rgb.astype(np.float32) / 65535.0β DSM & SEM files: Load normally with PIL
dsm = np.array(Image.open(dsm_path)).astype(np.float32)
sem = np.array(Image.open(sem_path)).astype(np.uint8)π Note: SSBH is an example of true geospatial 16-bit RGB TIFFs [0,65535], unlike many other typical remote sensing datasets that may use TIFF format but are in fact represented as standard [0,255] values.
β‘ Learning Rate Impact: The necessity to normalize 16-bit RGB imagery to [0,1] range will most probably impact your typical learning rate. You may need a significantly smaller rate compared to standard uint8 RGB datasets in the [0,255] range.
- Training: Use train split (70%) for model training
- Valid: Use valid split (10%) for hyperparameter tuning and model selection
- Testing: Use test split (20%) for final performance evaluation only
For Height Estimation: RMSE, MAE (in meters), RΒ² coefficient
For Building Segmentation: IoU, F1 Score, Precision, Recall
- Geographic scope: Chinese urban areas (62 cities, 67 data cuts)
- Temporal consistency: 2019 Sentinel-2 composites throughout
- Height accuracy: Based on crowdsourced building floor count data
- Spatial resolution: 10m per pixel, 256Γ256 pixel tiles
- Building focus: Original height data represents buildings exclusively
A comprehensive visualization shows samples from each split with all modalities:
The visualization displays:
- 3 samples from each split (train/valid/test)
- All modalities (RGB, Height, Building Mask) for each sample
- Clear visual separation between different samples
- Sample counts for each split labeled
- RGB Data: Stored in uint16 format, requires normalization to [0,1] for training. May require smaller learning rates than standard uint8 data.
- Reflectance Values: Represent true physical surface reflectance (0-100%) derived from Sentinel-2's calibrated measurements, providing more meaningful spectral information than simple pixel intensities.
- Contrast Enhancement: Percentile stretching (2nd-98th percentile) enhances visual contrast while avoiding saturation from extreme outliers, making imagery more suitable for both visualization and model training.
- Height Data: All values are non-negative (β₯0m) after successful interpolation of original invalid values.
- Building Masks: Direct threshold conversion (0m) with no morphological operations, preserving original spatial characteristics.
- Spatial Context: Each 256Γ256 pixel tile covers 2.56km Γ 2.56km at 10m resolution, suitable for building-level analysis.
- Height accuracy depends on crowdsourced floor count data quality
- Geographic scope limited to Chinese urban areas
- 10m pixel size limits detection of very small buildings
- RGB-only spectral information may limit some applications
If you use this split dataset, please cite the original SSBH paper:
@inproceedings{ma2025ssbh,
title={SSBH: a remote sensing building height dataset for deep learning},
author={Ma, Yao and Wang, Hao and Cao, Changhao},
booktitle={Sixth International Conference on Geoscience and Remote Sensing Mapping (GRSM 2024)},
volume={13506},
pages={388--394},
year={2025},
organization={SPIE}
}- Data Source: SSBH dataset covering 62 urban centers in China (67 data cuts) with 2019 Sentinel-2 composites
- Preprocessing: RGB enhancement, height interpolation (100% success rate), and binary mask generation
- Splitting: Random 70-10-20 split with seed 42 for reproducibility
- Total Samples: ~5,606 samples (after validation and quality checks)
- Version: v1.0 (2025)
