This Git project tracks the work related to the use of machine learning (ML) for tsunami onshore hazard prediction. The goal is to develop a surrogate model that can be linked with a regional offshore tsunami model, using offshore wave amplitude as a timeseries input.
The ML model is trained using simulation data provided by INGV and NGI for Eastern Sicily, with a focus on Catania and Siracusa. The dataset consists of 53550 events, and you can view the event details and data through the following HTML maps:
(Click the "Download Raw" button at the link to download the file, its html files created with folium)
The workflow for this project is as follows:
-
Preprocessing and Data Analysis
- Offshore statistics for all events and gauges
- Onshore statistics for all events at both sites
- Earthquake statistics for all events (already available)
-
Selection of Events for Experiment
- Events are selected based on specific criteria, such as stratified sampling parameters(typically - magnitude, displacement, depth, location, source type, etc.)
- In our work we are focusing on the following, and pick different sizes
- Offshore wave amplitude at selected points (maximum, time of maximum, etc.)
- Deformation characteristics (maximum, min, etc.)
- Onshore inundation characteristics (maximum depth, area, etc.)
-
Splitting the Event Selection
- The selected events are divided into training and testing sets(75:25). In ensemble learning mode, a cross validation approach is used with 4 folds shuffling across these subsets.
-
Training the ML Model(Pretraining and Fine-tuning of encoder-decoder model) and prediction
- The ML model is trained on the training set, with guidance based on the test set for hyperparameter tuning.
- Pretraining an offshore encoder (using a large dataset, not just the limited training set)
- Pretraining an deformation encoder (using a large dataset, not just the limited training set)
- Training an onshore decoder using the training set(as full simulation data is limited)
- Fine-tuning the decoder, interface, and encoder using the training set(using the limited full simulation data is limited)
- For the single encoder-decoder model, only one set of predictions are made on the test set.
Training the ML Model(Stochastic version) and prediction
- Here 4 encoder-decoder models are trained on each fold subsets of the training data.
- For the stochastic version each of the four fold model is used to generate 100 realisations on each event from the test set.
- The final prediction are presented with the mean and uncertainty bounds (+-2sigma)
-
Model Performance Evaluation
- The performance of the model is assessed using the unused dataset:
- Evaluation at control points to check misfit and bias in classification of flooding
- Evaluation at all inundation locations (using a single goodness-of-fit metric) and for subsets of different types
- Evaluation for events of specific magnitude, source, locations and tsunami parameters as maps and boxplots
- Evaluation for results with different training approaches, training sizes
- The performance of the model is assessed using the unused dataset:
-
Model Application
- The results are used to generate PTHA inundation maps for the regions of interest.
- The results are compared with HPC based results for a full ptha eventset, subset considering events that cause local deformation and events that dont cause any local deformation.
- The inundation hazard is used to implement with OpenQuake an event-based risk analysis. These results are used to benchmark the emulation hazard with different training sizes against HPC based results.