🌊 This repository implements the state-of-the-art models that power Google FloodHub.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
The repository provides open-source replication of Google’s global flood-forecasting models. By open-sourcing these models, we aim to foster transparency, enable in-house integration in production systems, and accelerate academic research.
This repository is a fork of NeuralHydrology, which has been heavily modified and extended to support forecast sequences using the specific model architectures that are used operationally in the Google FloodHub.
This repository contains implementations of the core models used in Google's production forecasting systems.
The Mean Embedding Forecast LSTM is a forecasting model that uses separate embedding networks for hindcast and forecast inputs. It aggregates these inputs using masked means before passing them into respective LSTMs for the hindcast and forecast periods.
- Status: Current production model (as of December 2025) for Google FloodHub.
- Reference: Gauch, Martin, et al. "How to deal with missing input data." Hydrology and Earth System Sciences (2025).
The State Handoff Forecast LSTM is a forecasting model that uses a state-handoff to transition from a hindcast sequence (LSTM) model to a forecast sequence (LSTM) model. The hindcast model runs from the past up to the present (the issue time of the forecast) and then passes the cell state and hidden state of the LSTM into a (nonlinear) handoff network, which is used to initialize a new LSTM that rolls out over the forecast period.
- Status: Former production model for Google FloodHub.
- Reference: Nearing, Grey, et al. "Global prediction of extreme floods in ungauged watersheds." Nature (2024).
We recommend using Conda to manage dependencies like PyTorch and CUDA.
-
Create and Activate the Environment:
# Create the environment from the file in the repo conda env create -f environments/conda.yml # Activate the environment (MANDATORY) conda activate googlehydrology -
Install the Package:
Install in editable mode so that changes to the source code are reflected immediately:# Run from the root of the repository pip install -e .
The most direct way to explore this repository is through our interactive tutorial: GoogleHydrology Tutorial Notebook.
What you will learn:
- Model Evaluation: Load pre-trained Google Hydrology models and calculate performance metrics (NSE, KGE) on real-world basin data.
- Fine-Tuning for Performance: Learn how to fine-tune the
static_attributes_fclayer. This is a powerful technique for improving predictions on "outlier" basins (e.g., basins with unusual sizes or geology) without retraining the entire model. - Visualizing Results: Compare model hydrographs against observed discharge data.
GoogleHydrology uses the Caravan dataset for streamflow observations and static catchment attributes.
A small sample is provided in tutorial/data/Caravan-nc. For full runs:
-
Visit the Zenodo repository.
-
Download the NetCDF version (Caravan-nc.tar.gz).
-
Unpack it locally:
mkdir -p ~/data/ tar -xvzf Caravan-nc.tar.gz -C ~/data/
The MultiMet forcing data extension is accessed directly from Google Cloud Storage. Ensure your configuration points to: gs://caravan-multimet/v1.1
The package installs the run command as the primary entry point.
run train --config-file /path/to/your/training_config_file.yml
Calculate performance metrics (NSE, KGE) on the test set:
run evaluate --run-dir /path/to/your/model_run/
Generate predictions (without skipping NaN observations):
run infer --run-dir /path/to/your/model_run/
Experiments are defined by YAML files. Update the following paths in your config (e.g., tutorial/training-config.yml):
- run_dir: Where weights and logs are saved.
- train_basin_file: Path to the list of basin IDs.
- targets_data_dir / statics_data_dir: Path to your local Caravan NetCDF data.
- dynamics_data_dir: Path to forcing data (e.g., gs://caravan-multimet/v1.1).
The ~/flood-forecasting/example-configs directory contains reference YAML files that define the experimental setups for different model architectures and datasets.
floodhub-settings-config.yml- Model Architecture:
mean_embedding_forecast_lstm - Dataset: MultiMet (Global Caravan dataset)
- Description: This configuration is designed to replicate the training settings of the current (2025) operational FloodHub model as closely as possible within this open-source framework.
- Model Architecture:
handoff-forecast-lstm-config.yml- Model Architecture:
handoff_forecast_lstm - Dataset: MultiMet (Global Caravan dataset)
- Description: Provides the settings used for the former operational model. This configuration aligns with the methodology described in the Nature (2024) paper for global ungauged flood prediction.
- Model Architecture:
camels-multimet-mean-embedding-forecast-lstm-config.yml- Model Architecture:
mean_embedding_forecast_lstm - Dataset: CAMELS-US (531 basins)
- Description: A benchmarking configuration for the Mean-Embedding model tailored for the CAMELS-US dataset. It is optimized for evaluating model stability and performance on a standard hydrological benchmark. Our team uses this as a reference point during model development, and it is included in this repository because this is what we use to ensure that any changes to the repository work as expected.
- Model Architecture:
camels-multimet-handoff-forecast-lstm-config.yml- Model Architecture:
handoff_forecast_lstm - Dataset: CAMELS-US (531 basins)
- Description: A benchmarking configuration for the State Handoff model tailored for the CAMELS-US dataset, used to compare the handoff approach against other architectures on US-based basin data.
- Model Architecture:
If you encounter bugs, please use the GitHub Issue Tracker. Provide a clear description, steps to reproduce, and the expected behavior.