xclim-timber: Climate Data Processing Pipeline

A robust Python pipeline for processing climate raster data and calculating climate indices using the xclim library. This pipeline efficiently handles large climate datasets from external drives, supporting both GeoTIFF and NetCDF formats.

Features

Multi-format Support: Load climate data from GeoTIFF and NetCDF files
Parallel Processing: Leverages Dask for efficient processing of large datasets
Comprehensive Indices: Calculate 80 climate indices including:
- Temperature indices (frost days, tropical nights, growing degree days, temperature variability)
- Precipitation indices (consecutive dry/wet days, extreme precipitation)
- Agricultural indices (growing season length, PET, corn heat units)
- Drought indices (SPI at 5 time windows, comprehensive dry spell metrics)
- Extreme event indices (heat waves, cold spells, spell frequency)
Data Quality Control: Automatic outlier detection and missing value handling
CF Compliance: Outputs follow Climate and Forecast (CF) conventions
Flexible Configuration: YAML-based configuration for easy customization

Installation

Clone the repository:

git clone https://github.com/yourusername/xclim-timber.git
cd xclim-timber

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Quick Start

One-Time Setup

Generate baseline percentiles for extreme indices (required for temperature, precipitation, and multivariate pipelines):

python calculate_baseline_percentiles.py

This is a one-time operation (~20-30 minutes) that calculates day-of-year percentiles from 1981-2000 baseline period for temperature extremes, precipitation extremes, and multivariate thresholds. The results are cached as data/baselines/baseline_percentiles_1981_2000.nc (10.7GB) for all future runs.

Running the Pipelines

Run temperature pipeline (35 indices - Phase 9):

python temperature_pipeline.py

Run precipitation pipeline (13 indices - Phase 6):

python precipitation_pipeline.py

Run humidity pipeline (8 indices):

python humidity_pipeline.py

Run human comfort pipeline (3 indices):

python human_comfort_pipeline.py

Run multivariate pipeline (4 indices):

python multivariate_pipeline.py

Run agricultural pipeline (5 indices - Phase 8):

python agricultural_pipeline.py

Run drought pipeline (12 indices - Phase 10 Final):

python drought_pipeline.py

All pipelines default to processing 1981-2024 data period. Use --start-year and --end-year to customize:

python temperature_pipeline.py --start-year 2000 --end-year 2020

Configuration

Edit the configuration file to customize:

Data paths: Location of input data on external drive
File patterns: Patterns to identify climate variable files
Processing options: Chunk sizes, Dask workers, resampling frequency
Climate indices: Select which indices to calculate
Output format: NetCDF or GeoTIFF

Example configuration snippet:

data:
  input_path: /media/external_drive/climate_data
  output_path: ./outputs
  file_patterns:
    temperature: ['*tas*.tif', '*temp*.nc']
    precipitation: ['*pr*.tif', '*precip*.nc']

processing:
  chunk_size:
    time: 365
    lat: 100
    lon: 100
  dask:
    n_workers: 4
    memory_limit: 4GB

indices:
  temperature:
    - tg_mean
    - frost_days
    - growing_degree_days
  precipitation:
    - prcptot
    - rx1day
    - cdd

Usage Examples

Basic Pipeline Usage

from src.pipeline import ClimateDataPipeline

# Initialize pipeline
pipeline = ClimateDataPipeline('config.yaml')

# Run complete pipeline
pipeline.run()

Loading Data Only

from src.config import Config
from src.data_loader import ClimateDataLoader

config = Config('config.yaml')
loader = ClimateDataLoader(config)

# Load temperature data
temp_data = loader.load_variable_data('temperature')
print(f"Loaded data shape: {dict(temp_data.dims)}")

Calculating Specific Indices

from src.indices_calculator import ClimateIndicesCalculator

calculator = ClimateIndicesCalculator(config)

# Calculate temperature indices
temp_indices = calculator.calculate_temperature_indices(temp_dataset)

# Save results
calculator.save_indices('outputs/indices.nc')

Climate Indices

This pipeline currently implements 80 validated climate indices (35 temperature + 13 precipitation + 8 humidity + 3 human comfort + 4 multivariate + 5 agricultural + 12 drought) achieving 100% of the 80-index goal. All indices follow World Meteorological Organization (WMO) standards and CF (Climate and Forecast) conventions using the xclim library.

Underlying Climate Variables

The pipeline processes these core climate variables:

Temperature Data:

tas: Near-surface air temperature (daily mean)
tasmax: Daily maximum near-surface air temperature
tasmin: Daily minimum near-surface air temperature

Precipitation Data:

pr: Daily precipitation amount

Humidity Data:

hus: Specific humidity (kg/kg)
hurs: Relative humidity (%)

Variable Name Flexibility: The system supports multiple naming conventions:

Temperature: 'tas', 'temperature', 'temp', 'tmean', 'tasmax', 'tmax', 'tasmin', 'tmin'
Precipitation: 'pr', 'precipitation', 'precip', 'prcp'
Humidity: 'hus', 'huss', 'specific_humidity', 'hurs', 'relative_humidity', 'rh'

Temperature Indices (35 indices - Currently Implemented, Phase 9 Complete)

Basic Statistics (3):

tg_mean: Annual mean temperature
tx_max: Annual maximum temperature
tn_min: Annual minimum temperature

Temperature Range Metrics (2):

daily_temperature_range: Mean daily temperature range (tmax - tmin)
extreme_temperature_range: Annual max(tmax) - min(tmin) (annual extremes span)

Threshold-Based Counts (6):

tropical_nights: Number of nights with minimum temperature > 20°C
frost_days: Number of days with minimum temperature < 0°C
ice_days: Number of days with maximum temperature < 0°C
summer_days: Number of days with maximum temperature > 25°C
hot_days: Number of days with maximum temperature > 30°C
consecutive_frost_days: Maximum consecutive frost days

Frost Season Indices (4):

frost_season_length: Duration from first to last frost (agricultural planning)
frost_free_season_start: Julian day of last spring frost (planting date)
frost_free_season_end: Julian day of first fall frost (harvest planning)
frost_free_season_length: Days between last spring and first fall frost

Degree Day Metrics (4):

growing_degree_days: Accumulated temperature above 10°C threshold (crop development)
heating_degree_days: Accumulated temperature below 17°C threshold (energy demand)
cooling_degree_days: Accumulated temperature above 18°C threshold (cooling energy demand)
freezing_degree_days: Accumulated temperature below 0°C (winter severity)

Extreme Percentile-Based Indices (6) - Uses 1981-2000 Baseline:

tx90p: Warm days (daily maximum temperature > 90th percentile)
tn90p: Warm nights (daily minimum temperature > 90th percentile)
tx10p: Cool days (daily maximum temperature < 10th percentile)
tn10p: Cool nights (daily minimum temperature < 10th percentile)
warm_spell_duration_index: Warm spell duration (≥6 consecutive warm days)
cold_spell_duration_index: Cold spell duration (≥6 consecutive cold days)

Advanced Temperature Extremes (8) - Phase 7:

growing_season_start: First day when temperature exceeds 5°C for 5+ consecutive days (ETCCDI standard)
growing_season_end: First day after July 1st when temperature drops below 5°C for 5+ consecutive days
cold_spell_frequency: Number of discrete cold spell events (temperature < -10°C for 5+ days)
hot_spell_frequency: Number of hot spell events (tasmax > 30°C for 3+ days)
heat_wave_frequency: Number of heat wave events (tasmin > 22°C AND tasmax > 30°C for 3+ days)
freezethaw_spell_frequency: Number of freeze-thaw cycles (tasmax > 0°C AND tasmin ≤ 0°C on same day)
last_spring_frost: Last day in spring when tasmin < 0°C (critical for agriculture)
daily_temperature_range_variability: Average day-to-day variation in daily temperature range (climate stability)

Temperature Variability (2) - Phase 9:

temperature_seasonality: Annual temperature coefficient of variation (standard deviation as percentage of mean) - ANUCLIM BIO4 variable
heat_wave_index: Total days that are part of a heat wave (5+ consecutive days with tasmax > 25°C)

Precipitation Indices (13 indices - Currently Implemented, Phase 6 Complete)

Basic Statistics (4):

prcptot: Total annual precipitation (wet days ≥ 1mm)
rx1day: Maximum 1-day precipitation amount
rx5day: Maximum 5-day precipitation amount
sdii: Simple daily intensity index (average precipitation on wet days)

Consecutive Events (2):

cdd: Maximum consecutive dry days (< 1mm)
cwd: Maximum consecutive wet days (≥ 1mm)

Extreme Percentile-Based Indices (2) - Uses 1981-2000 Baseline:

r95p: Very wet days (precipitation > 95th percentile of wet days)
r99p: Extremely wet days (precipitation > 99th percentile of wet days)

Fixed Threshold Indices (2):

r10mm: Heavy precipitation days (≥ 10mm)
r20mm: Very heavy precipitation days (≥ 20mm)

Enhanced Precipitation Analysis (3) - Phase 6:

dry_days: Total number of dry days (< 1mm)
wetdays: Total number of wet days (≥ 1mm)
wetdays_prop: Proportion of days that are wet

Humidity Indices (8 indices - Currently Implemented)

Dewpoint Statistics (4):

dewpoint_mean: Annual mean dewpoint temperature
dewpoint_min: Annual minimum dewpoint temperature
dewpoint_max: Annual maximum dewpoint temperature
humid_days: Days with dewpoint > 18°C (uncomfortable humidity)

Vapor Pressure Deficit (4):

vpdmax_mean: Annual mean maximum VPD
extreme_vpd_days: Days with VPD > 4 kPa (plant water stress)
vpdmin_mean: Annual mean minimum VPD
low_vpd_days: Days with VPD < 0.5 kPa (high moisture/fog potential)

Human Comfort Indices (3 indices - Currently Implemented)

Heat Stress Assessment:

heat_index: Heat index combining temperature and humidity effects (apparent temperature)
humidex: Canadian humidex index for apparent temperature

Humidity Validation:

relative_humidity: Relative humidity calculated from dewpoint temperature (QC metric)

Multivariate Indices (4 indices - Currently Implemented)

Compound Climate Extremes - Uses 1981-2000 Baseline:

cold_and_dry_days: Days with temperature below 25th percentile AND precipitation below 25th percentile (compound drought conditions)
cold_and_wet_days: Days with temperature below 25th percentile AND precipitation above 75th percentile (flooding risk, winter storms)
warm_and_dry_days: Days with temperature above 75th percentile AND precipitation below 25th percentile (drought/fire weather)
warm_and_wet_days: Days with temperature above 75th percentile AND precipitation above 75th percentile (compound extreme events)

Scientific Context: These multivariate indices capture compound climate extremes that result from the interaction of multiple climate variables. They are increasingly important for climate change impact assessment, as compound events often have disproportionate impacts compared to single-variable extremes.

Agricultural Indices (5 indices - Currently Implemented, Phase 8 Complete)

Growing Season Analysis (1):

growing_season_length: Total days between first and last occurrence of 6+ consecutive days with temperature above 5°C (ETCCDI standard)

Water Balance (1):

potential_evapotranspiration: Annual potential evapotranspiration using Baier-Robertson 1965 method (temperature-only, suitable for regions without wind/radiation data)

Crop-Specific Indices (1):

corn_heat_units: Annual accumulated corn heat units for crop development and maturity prediction (USDA standard, widely used in North American agriculture)

Spring Thaw Monitoring (1):

thawing_degree_days: Sum of degree-days above 0°C (permafrost monitoring, spring melt timing, critical for northern latitudes)

Growing Season Water Availability (1):

growing_season_precipitation: Total precipitation during growing season (April-October, northern hemisphere)

Agricultural Value: These indices support agricultural decision-making including crop variety selection, planting timing, irrigation scheduling, and harvest planning. They are particularly valuable for adapting to climate change impacts on agriculture.

Drought Indices (12 indices - Currently Implemented, Phase 10 Final - 100% Complete)

Standardized Precipitation Index (5 windows):

spi_1month: 1-month SPI for short-term agricultural drought monitoring
spi_3month: 3-month SPI for seasonal agricultural drought (most common)
spi_6month: 6-month SPI for medium-term agricultural/hydrological drought
spi_12month: 12-month SPI for long-term hydrological drought monitoring
spi_24month: 24-month SPI for multi-year persistent drought conditions

Dry Spell Analysis (4 indices):

cdd: Maximum consecutive dry days (< 1mm precipitation) - ETCCDI standard
dry_spell_frequency: Number of distinct dry spell events (≥3 consecutive days with < 1mm precipitation)
dry_spell_total_length: Total days in all dry spells per year (cumulative dry spell duration)
dry_days: Total number of dry days per year (< 1mm precipitation threshold)

Precipitation Intensity (3 indices):

sdii: Simple daily intensity index - average precipitation on wet days (ETCCDI standard)
max_7day_pr_intensity: Maximum precipitation over any 7-day rolling period (flood risk assessment)
fraction_heavy_precip: Fraction of annual precipitation from heavy events (> 75th percentile)

Drought Monitoring Value: SPI is the gold standard for drought monitoring following McKee et al. (1993) methodology. Multiple time windows enable detection of agricultural (1-6 months), hydrological (6-12 months), and long-term persistent (12-24 months) drought conditions. All SPI calculations use 30-year calibration period (1981-2010) with gamma distribution fitting. Dry spell metrics provide comprehensive drought event characterization including frequency, duration, and intensity.

🎉 80/80 Index Goal Achieved - 100% Complete!

All 80 planned climate indices have been successfully implemented across 7 comprehensive pipelines:

✅ 35 Temperature Indices (Phases 1-3, 7, 9)
✅ 13 Precipitation Indices (Phase 6)
✅ 8 Humidity Indices (Phase 2)
✅ 3 Human Comfort Indices (Phase 4)
✅ 4 Multivariate Indices (Phase 5)
✅ 5 Agricultural Indices (Phase 8)
✅ 12 Drought Indices (Phase 10)

Implementation Note: Three drought indices (dry_spell_frequency, dry_spell_total_length, max_7day_pr_intensity) were implemented using manual calculations to work around xclim unit compatibility issues, ensuring full coverage without compromising scientific accuracy.

Index Calculation Details

Processing Architecture:

All indices calculated using the scientifically-validated xclim library
Annual frequency: Most indices use annual calculations (freq='YS')
Robust error handling: Each index calculation includes comprehensive error handling
CF-compliant metadata: All outputs follow Climate and Forecast conventions

Pipeline Architecture

xclim-timber/
├── src/                    # Core pipeline modules
│   ├── config.py           # Configuration management
│   ├── data_loader.py      # Data loading from various formats
│   ├── preprocessor.py     # Data cleaning and standardization
│   ├── indices_calculator.py # Climate indices calculation
│   └── pipeline.py         # Main orchestration
├── scripts/                # Processing and analysis scripts
│   ├── csv_formatter.py    # CSV format converter (long ↔ wide)
│   ├── efficient_extraction.py # Optimized point extraction
│   ├── fast_point_extraction.py # Alternative extraction method
│   └── visualize_temp.py   # Results visualization
├── data/                   # Data files
│   ├── test_data/          # Test datasets and coordinates
│   └── sample_data/        # Sample data for development
├── outputs/                # Processed results
├── logs/                   # Processing logs
├── docs/                   # Documentation
├── benchmarks/             # Performance benchmarks
└── requirements.txt        # Dependencies

Performance Optimization

Chunking: Data is automatically chunked for efficient memory usage
Parallel Processing: Dask enables parallel computation across multiple cores
Lazy Evaluation: Operations are queued and executed efficiently
Memory Management: Large datasets are processed without loading entirely into memory

Command Line Interface

# Show help
python src/pipeline.py --help

# Run with verbose logging
python src/pipeline.py -c config.yaml --verbose

# Specify output directory
python src/pipeline.py -c config.yaml -o /path/to/output

# Process specific variables
python src/pipeline.py -c config.yaml -v temperature -v precipitation

4. **Format CSV outputs** (optional):
```bash
# Convert to both long and wide formats
python scripts/csv_formatter.py --input-dir outputs --output-dir outputs/formatted

Monitoring

Access the Dask dashboard during processing to monitor:

Worker activity
Memory usage
Task progress
Performance metrics

Dashboard typically available at: http://localhost:8787

Troubleshooting

Memory Issues

Reduce chunk sizes in configuration
Decrease number of Dask workers
Process variables separately

Missing Data

Check file patterns in configuration
Verify data path accessibility
Review logs for loading errors

Performance

Increase Dask workers for more parallelism
Optimize chunk sizes for your data
Consider temporal/spatial subsetting

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

License

This project is licensed under the MIT License.

Acknowledgments

Built on xclim for climate index calculations
Uses xarray for N-dimensional data handling
Powered by Dask for parallel computing
Supports rioxarray for geospatial operations

Citation

If you use this pipeline in your research, please cite:

xclim-timber: Climate Data Processing Pipeline
https://github.com/yourusername/xclim-timber

And the xclim library:

Bourgault et al., (2023). xclim: xarray-based climate data analytics. 
Journal of Open Source Software, 8(85), 5415, https://doi.org/10.21105/joss.05415

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github/issues		.github/issues
archive/legacy-pipeline		archive/legacy-pipeline
core		core
data		data
docs		docs
scripts		scripts
tests		tests
tools		tools
validation		validation
.gitignore		.gitignore
PRODUCTION_GUIDE.md		PRODUCTION_GUIDE.md
README.md		README.md
agricultural_pipeline.py		agricultural_pipeline.py
drought_pipeline.py		drought_pipeline.py
human_comfort_pipeline.py		human_comfort_pipeline.py
humidity_pipeline.py		humidity_pipeline.py
multivariate_pipeline.py		multivariate_pipeline.py
precipitation_pipeline.py		precipitation_pipeline.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
temperature_pipeline.py		temperature_pipeline.py

mihiarc/xclim-timber

Folders and files

Latest commit

History

Repository files navigation