CMCC-Foundation
diff --git a/‎README.md‎
Lines changed: 12 additions & 2 deletions b/‎README.md‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎notebook/README.md‎
Lines changed: 3 additions & 3 deletions b/‎notebook/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎notebook/inference_notebook.ipynb‎
Lines changed: 11 additions & 4 deletions b/‎notebook/inference_notebook.ipynb‎
Lines changed: 11 additions & 4 deletions
diff --git a/‎notebook/inference_notebook_test.ipynb‎
Lines changed: 61 additions & 80 deletions b/‎notebook/inference_notebook_test.ipynb‎
Lines changed: 61 additions & 80 deletions
@@ -1,7 +1,7 @@
 # Machine Learning Tropical Cyclones Detection
 
 ## Overview
-The repository provides a Machine Learning (ML) library to setup training and validation of a Tropical Cyclones (TCs) Detection model. ERA5 reanalysis and the International Best Track Archive for Climate Stewardship (IBTrACS) data are used as input and the target, respectively. Input-Output data pairs are provided as Zarr data stores.
+The repository provides a Machine Learning (ML) library to setup training and validation of a Tropical Cyclones (TCs) Detection model and run the tracking. ERA5 reanalysis and the International Best Track Archive for Climate Stewardship (IBTrACS) data are used as input and the target, respectively. Input-Output data pairs are provided as Zarr data stores.
 
 The model can use the following input drivers:
 - 10m wind gust [ $\frac{m}{s}$]
@@ -32,7 +32,9 @@ The _train.py_ script takes advantage of the Command Line Interface (CLI) to pas
 - `--devices` argument defines the number of GPU devices per node to run the training on.
 - `--num_nodes` argument defines the total number of nodes that will be used.
 
-The total number of GPUs used during the training can be evinced by simply multiplying `devices * num_nodes`.
+The total number of GPUs used during the training can be evinced by simply multiplying `devices * num_nodes`. 
+
+A bash script for the training, _train.sh_ , is also provided under the same folder. 
 
 With regards to the configuration file, it must be prepared in toml format. The configuration file is structured as follows:
 
@@ -84,6 +86,10 @@ With regards to the configuration file, it must be prepared in toml format. The
     - drop_remainder: whether or not to drop the last batch if the number of dataset elements is not divisible by the batch size
     - accumulation_steps: number of gradient accumulation steps before calliing backward propagation
 
+### Pre-processing workflow
+
+A workflow based on PyOphidia for preparing CMIP6 data for TC detection is provided under the `workflows` folder.
+
 ## How to
 
 ### Download IBTrACS
@@ -99,6 +105,10 @@ Since the TC Detection case study relies on IBTrACS dataset, it must be download
 
 To download ERA5 data you must need a CDS account and the set of IBTrACS for which the reated ERA5 data is gathered. The script `era5_gathering.py` under `src/dataset` can be used for this purpose.
 
+## Example notebooks
+
+Example notebooks for executing and evaluating a trained ML model are provided under the `notebooks` folder.
+
 ## Python3 Environment 
 The code has been tested on Python 3.11.2 with the following dependencies:
 
 
@@ -1,4 +1,4 @@
-### Notebook 1 – inference_noteook.ipynb
+### Notebook 1 – inference_notebook.ipynb
 
 This notebook is designed to perform inference using the trained model for TCs detection and to apply a tracking algorithm to identify the trajectories of the detected systems. It can be used both on historical and projection data.
 
@@ -8,7 +8,7 @@ Workflow
     For first, the user specifies:  
         - `main_dir`: root directory of the project.  
         - `dataset_dir`: path to the climate dataset to be analyzed (e.g., CMIP6, NICAM, ERA5).  
-        - `model_dir`: path to the pre-trained model to be used for inference.  
+        - `run_name`: name of the pre-trained model on MLflow to be used for inference.  
         - `ibtracs_src`: path to the **IBTrACS** file used as ground truth for validation.  
         - `year`: the year on which inference will be performed.  
         - `device`: compute device (`cpu`, `cuda`, `mps`, etc.).  
@@ -17,7 +17,7 @@ Workflow
 2. Model and dataset loading
     Standard cells are provided to:  
         - Load the pre-trained model  
-        - Load the input dataset  
+        - Select and load the input dataset. For CMIP6 data the climate model and time period can be selected.  
         - Prepare the data for inference  
 
 3. Inference
 
@@ -55,7 +55,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Select the model by specfying the run name from the MLFlow and download model, scaler and provenance document"
+    "Select the model by specifying the `run_name` from the MLFlow and download model, scaler and provenance document"
    ]
   },
   {
@@ -97,7 +97,7 @@
    "source": [
     "## Inference workflow on historical data\n",
     "\n",
-    "Let's get the data on a given time frame (year and month) for the evaluation"
+    "Let's select the historical data (ERA5) on a given time frame (year and month) for the evaluation"
    ]
   },
   {
@@ -146,7 +146,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now detect and localize the TC centers with the ML model and load also the observed TCs"
+    "We can now detect and localize the TC centers with the ML model and load the observed TCs from IBTrACS data"
    ]
   },
   {
@@ -206,6 +206,13 @@
     "### Compare with observations"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Compute POD and FAR of the detected track"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -248,7 +255,7 @@
    "source": [
     "## Inference workflow on projection data\n",
     "\n",
-    "Let's get CMIP6 data on a given time frame (year and month) for the evaluation"
+    "Let's select CMIP6 data on a given time frame (year and month) for the evaluation"
    ]
   },
   {
Original file line number	Diff line number	Diff line change
`@@ -55,7 +55,7 @@`
`55`	`55`	`"cell_type": "markdown",`
`56`	`56`	`"metadata": {},`
`57`	`57`	`"source": [`
`58`		`- "Select the model by specfying the run name from the MLFlow and download model, scaler and provenance document"`
	`58`	+ "Select the model by specifying the `run_name` from the MLFlow and download model, scaler and provenance document"
`59`	`59`	`]`
`60`	`60`	`},`
`61`	`61`	`{`
`@@ -97,7 +97,7 @@`
`97`	`97`	`"source": [`
`98`	`98`	`"## Inference workflow on historical data\n",`
`99`	`99`	`"\n",`
`100`		`- "Let's get the data on a given time frame (year and month) for the evaluation"`
	`100`	`+ "Let's select the historical data (ERA5) on a given time frame (year and month) for the evaluation"`
`101`	`101`	`]`
`102`	`102`	`},`
`103`	`103`	`{`
`@@ -146,7 +146,7 @@`
`146`	`146`	`"cell_type": "markdown",`
`147`	`147`	`"metadata": {},`
`148`	`148`	`"source": [`
`149`		`- "We can now detect and localize the TC centers with the ML model and load also the observed TCs"`
	`149`	`+ "We can now detect and localize the TC centers with the ML model and load the observed TCs from IBTrACS data"`
`150`	`150`	`]`
`151`	`151`	`},`
`152`	`152`	`{`
`@@ -206,6 +206,13 @@`
`206`	`206`	`"### Compare with observations"`
`207`	`207`	`]`
`208`	`208`	`},`
	`209`	`+ {`
	`210`	`+ "cell_type": "markdown",`
	`211`	`+ "metadata": {},`
	`212`	`+ "source": [`
	`213`	`+ "Compute POD and FAR of the detected track"`
	`214`	`+ ]`
	`215`	`+ },`
`209`	`216`	`{`
`210`	`217`	`"cell_type": "code",`
`211`	`218`	`"execution_count": null,`
`@@ -248,7 +255,7 @@`
`248`	`255`	`"source": [`
`249`	`256`	`"## Inference workflow on projection data\n",`
`250`	`257`	`"\n",`
`251`		`- "Let's get CMIP6 data on a given time frame (year and month) for the evaluation"`
	`258`	`+ "Let's select CMIP6 data on a given time frame (year and month) for the evaluation"`
`252`	`259`	`]`
`253`	`260`	`},`
`254`	`261`	`{`