Skip to content

rodekruis/exposure-vulnerability-retrieval

Repository files navigation

Data Retrieval

ETL Pipeline

Example use

To run the pipeline for Somalia (with iso3 code 'SOM'):

poetry install
poetry run python run_pipeline --country-iso3 SOM

Options:

  • -h, --help : show this help message and exit
  • --country-iso3 STR: Country ISO3 code to run the pipeline for. (required)
  • --run-id STR: Unique identifier for the pipeline run. (default: now)
  • --extract, --no-extract: Boolean to indicate if extract should be performed. (default: True)
  • --transform, --no-transform: Boolean to indicate if transform should be performed. (default: True)
  • --load, --no-load: Boolean to indicate if load should be performed. (default: True)
  • --debug, --no-debug: Boolean to indicate if logger level should be set to DEBUG instead of INFO. (default: False)

By default it runs the both the extracts, transforms and loads. To turn of the extracts add the flag --no-extract, similarly add --no-transform or --no-load to turn of the transform or load steps.

Pipeline Configurations/Specification

One pipeline run is a ETL run for one country. For the construction of a pipeline, configuration from CONFIGS from retrievalpipelines/config is used.

Every country, for which the pipeline can possible run, has a configuration file in retrievalpipelines/config (e.g., somalia.py). These files contain a class that implements the CountryConfig protocol defined in retrievalpipelines/config/base.py.

To add a configuration for a Country add a file for this country with a CountryConfig class and add the class to _CONFIGS in retrievalpipelines/config/config.py.

Pipeline structure

The pipeline consists of Extractors, Transforms and Loaders.

Extractors/Transforms/Loaders are classes that adhere to the Extractor/Transformer/Loader Protocols.

Each extract method of an extractor contains the logic to decide what to download and from where. It then uses a storage to download and store the file(s) somewhere on the bronze data layer.

The transforms takes data from the bronze data layer and transforms it to and saves intermediate results to the silver layer. Lastly, it transforms the silver data to the gold data layer.

The loader uploads the data from gold to where it can be used by other systems.

Storage

An extraction needs a storage that can download a file from a url and a path where to save the file. For example, the storage can be one that saves the file locally (LocalStorage) or on a azure blob storage (AzureBlobStorage).

Data extraction specifics

ECMWF

For the ECMWF extractors it is needed to have an account. Set the api_key in environment variable ECMWF_DATASTORES_KEY. When this is put in .env file it is automatically picked up during tests.

Furthermore, for the ECMWF extractors one needs to accept the Terms of use. This is done on the ECMWF website and requires you to be logged in (corresponding to the account of the api-key). For example, for the extreme heat dataset see: https://cds.climate.copernicus.eu/datasets/derived-utci-historical?tab=download.

CHIRPS

TBA

WorldPop

TBA

IOM DTM

IOM DTM data requires API key to access data. See more about the DTM API latest version V3. Set the api_key in environment variable DTM_API_KEY in the .env file.

IPC

For IPC extractor, API key is needed. Set api key: TBA

FEWS NET

TBA

Data transformation

After extractions, the transformations turns those datasets into district‑level suitable for multi‑criteria prioritisation. It operates on three broad climate hazards—drought, flood and extreme heat—and produces baseline, recent and trend indicators for each, as well as other transformations.

Indicators

The table below summarises what each climate indicator measures and the datasets it uses.

Hazard & period Description (what it measures) Main input datasets & processing summary
Drought baseline Quantifies long‑term drought vulnerability (1991–2020) by combining three components: variability of annual precipitation (coefficient of variation), frequency of years below average precipitation and the deficit relative to the mean. These are normalised and combined using a quadratic mean to produce a district‑level index. CHIRPS precipitation NetCDF for 1991–2020; data are reprojected to EPSG:3857, clipped to ADM2 boundaries and processed to compute CV, frequency and intensity before normalisation and combination.
Flood baseline Measures long‑term flood exposure using the JRC Monthly Water History (1984–2021). The pipeline filters out permanent water, aggregates monthly flooded area to annual percentages and normalises the resulting values. JRC Monthly Water History v1.4 rasters; data are downloaded for the baseline period, reprojected, clipped to ADM2, non‑permanent water masked, aggregated and min–max normalised.
Extreme heat baseline Captures the frequency, intensity and persistency of extreme heat during 1991–2020. Extreme heat is defined using a UTCI threshold (default 32 °C). The daily maximum UTCI and exceedances are computed, then normalised and combined. Hourly UTCI derived from ERA5 reanalysis (UTCI NetCDF); data are projected, clipped to ADM2, daily maxima calculated, exceedances counted and aggregated before normalisation and combination.
Drought recent Assesses recent drought anomalies (typically last 24 months) by computing the Standardised Precipitation Index (SPI) on CHIRPS monthly rainfall and deriving frequency, severity and persistency of severe drought (SPI < –1.5). CHIRPS monthly NetCDF; SPI is computed for the last two years, drought frequency/severity/persistency are derived and normalised before combining.
Flood recent Quantifies flood exposure in the last two years using GFM products. Flooded area is aggregated monthly and normalised. Copernicus GFM Product GeoTIFFs; data are reprojected, clipped to ADM2, aggregated to annual flood area percentages and normalised.
Extreme heat recent Measures recent (last two years) extreme heat frequency, intensity and persistency. Uses the same approach as the baseline but restricted to recent data. Recent UTCI (ERA5) hourly NetCDF; the number and degree of threshold exceedances and maximum consecutive exceedance duration are computed, normalised and combined.
Drought trend Evaluates 30‑year trends in drought risk by fitting linear regressions to time series of drought frequency, intensity and persistency. Negative slopes are clipped to zero before normalisation. CHIRPS monthly NetCDF; SPI‑based metrics are computed for each year, trends estimated by linear regression and normalised before combination.
Flood trend Tracks trends in flood extent over 1984–2021 by fitting a linear regression to annual flood percentages. Negative trends are set to zero and positive slopes normalised. JRC Monthly Water History v1.4 rasters; annual flooded area percentages are computed, trends estimated via linear regression and normalised.
Extreme heat trend Assesses trends in extreme heat frequency, intensity and persistency over the last 30 years using UTCI data. Negative trends are clipped and remaining slopes normalised and combined. ERA5‑derived UTCI hourly NetCDF; annual time series of exceedance frequency/intensity/persistency are built, linear trends computed and normalised.
Displacement TBA TBA
Food insecurity TBA TBA
Population TBA TBA

To contribute

Dependencies

If needed install poetry (or uv). Then install dependencies poetry install or if your poetry.lock is behind of pyproject.toml first resolve the dependencies with poetry lock.

Pre-commit

Activate pre-commit with poetry run pre-commit install.

When you make a commit it will first run some checks. These can be found in .pre-commit-config.yaml. It helps you from committing mistakes or messy code. It checks things like: valid jsons, if your type hints are correct, if the code is formatted and if the quality is ok. When the checks fail the commit is stopped and you can fix the issues. If it can trivially fix things it will do this for you. Then you can inspect the changes it made, add the file again and run the commit again. The rules the linter checks are specified in pyproject.toml. Explanations and reasons for these rules can be found in the Ruff rules doc.

More info:

Test

Run the tests in the tests folder with:

poetry run python -m pytest tests

The tests that needs secret load the environment variables in .env.

Run tests that do not needs secrets with:

poetry run python -m pytest tests -m "not needs_secrets"

To show the logs during testing run:

poetry run python -m pytest tests -log_cli=1 -s

CI

When you make a PR to main or dev the CI-pipeline will run (.github/workflows/ci.yaml).

This checks if the code is formatted and if the linter agrees with the code quality. Furthermore, it runs the tests in the folder tests that do not need secrets (i.e., do not have the mark 'needs_secrets') and checks if the coverage is above a certain threshold (to be set in ci.yaml).

About

Disaster exposure and vulnerability data retrieval

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6