A clear, descriptive README.md for batch_sentinel_classify_full_fixed.py.
This document explains what the script does, how it discovers Sentinel SAFE/ZIP/TIF inputs, the processing pipeline, all CLI options, internal configuration mapping, outputs produced, recommended usage patterns for different dataset sizes, and troubleshooting tips.
batch_sentinel_classify_full_fixed.py is a self-contained Batch Sentinel-2 3-class classifier (water / vegetation / other) implemented in pure Python + NumPy + rasterio.
It is designed to safely handle large scenes by using chunked processing, randomized/incremental PCA, and a choice of KMeans variants (full, chunked, minibatch). The script focuses on robustness: defensive handling of missing bands, fallback water indices, chunk-safety for memory, and diagnostic outputs for debugging.
Key features
- Reads Sentinel SAFE folders or SAFE zipfiles (only scans
IMG_DATAsubfolders) and single multiband or single-band TIF/TIFF. - Builds per-pixel features (spectral bands, NDVI, water index, PCA components, DFT magnitude).
- Supports PCA modes:
sample,incremental,randomized. - Supports clustering modes:
full,chunked,minibatch. - Outputs classified GeoTIFF, visualization PNG, cluster stats CSV, per-scene area CSV, and a summary CSV for all scenes.
- (Optional) Start a tmux session for long runs:
brew install tmux
tmux new -s sentinel- Create & activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Run (example — default/
minibatchrecommended for large scenes):
python3 batch_sentinel_classify_full_fixed.py \
--input_dir input_data \
--outdir outputs \
--pca_mode randomized \
--randomized_components 10 \
--randomized_iters 3 \
--kmeans_mode minibatchThe script looks for scenes in --input_dir using the following logic:
-
SAFE zip files: any
.zipwhose filename contains.SAFEis opened and only files whose internal path contains/IMG_DATA/are considered. JP2 band filenames matched:B02.JP2, B03.JP2, B04.JP2, B08.JP2, B11.JP2. Paths are exposed to rasterio using/vsizip/<abs_zip>/<inner_path>. -
SAFE folders: directories ending with
.SAFEare searched but onlyIMG_DATAsubtrees are scanned for band JP2 files. This prevents picking up ancillary non-image files in the SAFE tree. -
TIF/TIFF files:
- If
rasterio.open(file).count >= 4, it's treated as amultiband_tif(expecting bands in order). - Otherwise, the filename is inspected for band tokens (
B02,B03,B04,B08,B11) and single-band entries are grouped together into scenes.
- If
-
After discovery, single-band files are merged by common basename into scene groups so the rest of the pipeline can read a full band set or a multiband TIF.
Required bands: B02 (green), B03 (blue), B04 (red), B08 (NIR). B11 (SWIR1) is optional: when missing, the script falls back from MNDWI to NDWI (McFeeters).
-
Read bands — read B02, B03, B04, B08 and optionally B11. Resampling happens to B08 shape when bands are in separate files.
-
Diagnostics — prints per-band stats (shape, min/max, zeros, NaNs) to help with data validity checks.
-
Masking — constructs a
valid_maskusing finite & non-zero tests, falling back to finite-only if everything was masked. -
Feature computation
- NDVI:
(NIR - RED) / (NIR + RED) - Water:
MNDWIusing (GREEN, SWIR1) if B11 available, otherwiseNDWI(GREEN, NIR) fallback. - PCA features:
compute_pca_features(simple PCA on stacked bands) used for additional descriptors. - DFT magnitude: FFT-based magnitude of first PCA component (normalized).
- Feature vector per pixel includes spectral bands, NDVI, water index, first two PCA components, and DFT magnitude.
- NDVI:
-
Preprocessing — flatten, filter by valid mask, standardize features (z-score).
-
Dimensionality reduction — options:
sample— PCA on a random sample (--pca_sample) then whiten.incremental— chunked covariance accumulation (exact but slower).randomized— randomized power-method PCA for approximate components.
-
Clustering — KMeans variants:
full— vectorized full KMeans (memory heavy).chunked— KMeans with chunked label assignment to reduce memory footprint.minibatch— Mini-batch KMeans (default) with minibatches for scaling.
-
Postprocessing
- Build full-label image (labels for masked pixels set to -1).
- Map clusters to classes (water=0, vegetation=1, other=2) using z-scored cluster-level heuristics on NDVI, water, and NIR means.
- Produce diagnostics (NDVI vs water scatter), cluster stats CSV, areas (pixels → m²), GeoTIFF and visualization PNG.
--input_dir (required) Path with SAFE folders/zips or TIF files.
--outdir (str) Output directory. Default: outputs
# PCA options
--pca_mode (sample|incremental|randomized) Default: sample
--pca_sample (int) Sample size for sample-mode PCA. Default: 100000
--pca_var_thresh (float) Variance retention threshold for PCA. Default: 0.95
--randomized_components (int) #components for randomized PCA. Default: 8
--randomized_iters (int) Power iterations for randomized PCA. Default: 3
--pca_chunk (int) Chunk size used during PCA incremental/randomized ops. Default: 200000
# KMeans / clustering
--kmeans_mode (full|chunked|minibatch) Default: minibatch
--kmeans_chunk (int) Chunk size for chunked-mode label assignment (defaults to 200000 internally if not provided)
--kmeans_batch (int) Minibatch size for minibatch kmeans. Default: 15000
--kmeans_iters (int) Max iterations for minibatch kmeans. Default: 2000
--kmeans_init_iters (int) Initialization minibatch iterations (minibatch mode). Default: 5
--kmeans_assign_chunk (int) Chunk used when assigning final labels in minibatch mode. Default: 200000
Note: The script normalizes some
NoneCLI values to safe defaults (e.g., chunk sizes default to200000insidemain()), so you may omit them if using standard settings.
kcfg— used forchunkedandfullpaths. Contains:mode, k, max_iter, tol, seed, chunk_size, batch_size, max_iter, init_iters, learning_rate, assign_chunk.kparams_for_minibatch— used forminibatchfunction call. Contains:k, batch_size, max_iter, init_iters, learning_rate, seed, assign_chunk.pcfg— PCA config:mode, sample_size, var_thresh, randomized_components, randomized_iters, eps, chunk_size.
The script currently hardcodes k = 3 cluster centers (water / veg / other). If you need a variable k or variable seed, add CLI flags and pass them to the kcfg / kparams_for_minibatch dicts.
For each detected scene <scene_id> the script writes to <outdir>/<scene_id>/:
<scene_id>_classified.tif— single-band uint8 GeoTIFF with class codes (0=water, 1=vegetation, 2=other).<scene_id>_classified_vis.png— RGB visualization (blue=water, green=veg, red=other).<scene_id>_ndvi_vs_<water_index>.png— diagnostic scatter of NDVI vs water-index (sampled points).<scene_id>_cluster_stats.csv— per-cluster statistics and heuristic scores used for mapping to classes.<scene_id>_areas.csv— pixel counts and computed area (m²) per class.outputs/summary_all_scenes.csv— aggregated table of scenes and their areas.
Default / recommended for big scenes (minibatch):
python3 batch_sentinel_classify_full_fixed.py \
--input_dir input_data \
--outdir outputs \
--pca_mode randomized \
--randomized_components 10 \
--randomized_iters 3 \
--kmeans_mode minibatchChunked KMeans (lower memory than full):
python3 batch_sentinel_classify_full_fixed.py \
--input_dir input_data \
--outdir outputs \
--pca_mode sample \
--pca_sample 100000 \
--kmeans_mode chunked \
--kmeans_chunk 200000Full KMeans (only when dataset is small enough to fit in memory):
python3 batch_sentinel_classify_full_fixed.py \
--input_dir input_data \
--outdir outputs \
--pca_mode sample \
--kmeans_mode fullIncremental PCA + chunked KMeans (exact PCA for very large data):
python3 batch_sentinel_classify_full_fixed.py \
--input_dir input_data \
--outdir outputs \
--pca_mode incremental \
--pca_chunk 200000 \
--kmeans_mode chunked \
--kmeans_chunk 200000- Memory tuning: increase
--kmeans_chunk,--kmeans_assign_chunk, and--pca_chunkto reduce I/O passes but at the cost of RAM; decrease them to lower memory footprint. - Minibatch sizing: increase
--kmeans_batchfor better center stability per iteration if you have RAM; decrease for lower memory usage. - PCA strategy: use
randomizedfor speed with large D/N and approximate components; useincrementalfor exact PCA when accuracy matters and you can accept the I/O cost;sampleis a practical default. - Reproducibility: the script uses a fixed RNG seed (
42) internally. Add--seedinto CLI and pass it into thekcfgand minibatch params if you need variable seeds or reproducible runs with a different seed. - Missing bands: If B11 is missing the script will use
NDWI (GREEN,NIR)as a fallback water index. Required bands are B02,B03,B04,B08.
- "No scenes discovered" — ensure
--input_dircontains.SAFEfolders, SAFE-like.zipfiles, or TIF/TIFFs. Check that zip names include.SAFEif using SAFE zips. - "No valid pixels" — check per-band diagnostics printed to logs. The script treats
0andNaNas invalid by default. If your sensor uses fill values not equal to0(or if0is a valid value), adjust the masking logic. - Long runtimes / OOM errors — reduce chunk sizes and minibatch sizes, or use
tmuxto keep sessions alive and monitor memory. - Unexpected class mapping — check
*_cluster_stats.csvto see per-cluster NDVI/water/NIR means and heuristic scores; tweak mapping heuristics inmap_clusters()if needed.
Common modifications you may want:
- Add CLI flags for
--k(cluster count) and--seed. - Add logging verbosity level to toggle debug prints.
- Add optional multi-threaded I/O or use rasterio windowed reads for micro-optimizations.
- Add a simple progress bar for per-chunk operations.