digitalearthafrica · mickwelli · Apr 27, 2025 · Jan 22, 2025 · Jan 23, 2025 · Jan 23, 2025
diff --git a/Use_cases/Limpopo_River_Basin/01_dam_volume_data_preprocessing.ipynb b/Use_cases/Limpopo_River_Basin/01_dam_volume_data_preprocessing.ipynb
diff --git a/Use_cases/Limpopo_River_Basin/02_dam_volume_model_training.ipynb b/Use_cases/Limpopo_River_Basin/02_dam_volume_model_training.ipynb
@@ -0,0 +1,396 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Modelling dam volumes using DEA waterbodies\n",
+    "# Section 02  : *Model Training*\n",
+    "**Products used:** \n",
+    "[DE Africa Waterbodies](https://docs.digitalearthafrica.org/en/latest/data_specs/Waterbodies_specs.html), \n",
+    "[Department of Water Affairs and Sanitation, South Africa Dam Level and Volume Data](https://www.dws.gov.za/Hydrology/Verified/hymain.aspx)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Background\n",
+    "\n",
+    "### Digital Twin (DT)\n",
+    "The CGIAR Digital Twin initiative creates dynamic virtual models that combine real-time data, AI, and simulations to improve decision-making. Its prototype for the Limpopo River Basin focuses on enhancing water resource management and conservation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Description\n",
+    "This notebook presents a workflow for predicting dam levels and volumes using water surface area data from DE Africa's Waterbodies product, integrating data preprocessing, feature extraction, and Gradient Boosting modeling. \n",
+    "\n",
+    "As part of the CGIAR Initiative on Digital Innovation, this work contributes to a prototype Digital Twin for the Limpopo River Basin, designed to support real-time decision-making in water management. The Digital Twin leverages AI-driven tools to visualize and simulate the impact of decisions on the basin's ecosystem. To enhance prediction reliability, the model includes a correction mechanism to address unrealistic large drops in dam volume estimates.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting started\n",
+    "\n",
+    "To run this analysis, run all the cells in the notebook, starting with the \"Load packages\" cell. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Load packages\n",
+    "Import Python packages that are used for the analysis."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%matplotlib inline\n",
+    "\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "import seaborn as sns\n",
+    "import datacube\n",
+    "import joblib\n",
+    "\n",
+    "from scipy import interpolate\n",
+    "from scipy.optimize import curve_fit\n",
+    "from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score\n",
+    "from sklearn.preprocessing import StandardScaler, PolynomialFeatures\n",
+    "from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error\n",
+    "from sklearn.impute import SimpleImputer\n",
+    "from sklearn.ensemble import GradientBoostingRegressor\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import plotly.graph_objs as go\n",
+    "import plotly.express as px\n",
+    "import plotly.graph_objects as go\n",
+    "\n",
+    "from deafrica_tools.waterbodies import get_waterbody, get_time_series, display_time_series\n",
+    "from IPython.display import Image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load Model Training Datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### Sample Data Overview\n",
+    "\n",
+    "This dataset contains raw water levels data collected from DEA (Department of Environmental Affairs) in South Africa. The cell below reads this ancillary data necessary to conduct the volume prediction."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data Preparation and Feature Selection\n",
+    "This section involves selecting the relevant features and handling missing values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "merged_data = pd.read_csv(\"data/preprocess_data.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = merged_data[['calculated_level', 'water_area_ha']]\n",
+    "target = merged_data['Dam_Level']\n",
+    "print(f\"Initial data shape: {features.shape}\")\n",
+    "print(f\"Missing values in features:\\n{features.isnull().sum()}\")\n",
+    "\n",
+    "imputer = SimpleImputer(strategy='mean')\n",
+    "features_imputed = imputer.fit_transform(features)\n",
+    "\n",
+    "features_imputed_df = pd.DataFrame(features_imputed, columns=features.columns)\n",
+    "print(f\"Data shape after imputing missing values: {features_imputed_df.shape}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Splitting Data for Training and Testing\n",
+    "Here, we split the data into training and test sets."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train, X_test, y_train, y_test = train_test_split(features_imputed_df, target, test_size=0.2, random_state=42)\n",
+    "print(f\"Training set size: {X_train.shape[0]} rows\")\n",
+    "print(f\"Test set size: {X_test.shape[0]} rows\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Performing Cross-Validation\n",
+    "This step involves evaluating the model using cross-validation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Performing cross-validation...\")\n",
+    "gradient_boosting = GradientBoostingRegressor(random_state=42)\n",
+    "cv_scores = cross_val_score(gradient_boosting, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
+    "cv_rmse = np.sqrt(-cv_scores).mean()\n",
+    "print(f\"Cross-validated RMSE: {cv_rmse:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Hyperparameter Tuning\n",
+    "We perform hyperparameter tuning using GridSearchCV to find the best combination of parameters for the Gradient Boosting Regressor."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "param_grid = {\n",
+    "    'n_estimators': [100, 200, 300],\n",
+    "    'learning_rate': [0.01, 0.1, 0.2],\n",
+    "    'max_depth': [3, 4, 5]\n",
+    "}\n",
+    "print(\"Performing hyperparameter tuning using GridSearchCV...\")\n",
+    "grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')\n",
+    "grid_search.fit(X_train, y_train)\n",
+    "best_model = grid_search.best_estimator_\n",
+    "print(f\"Best parameters: {grid_search.best_params_}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Training the Best Model\n",
+    "We now train the model with the optimal hyperparameters found in the previous step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\nTraining the best model with optimal hyperparameters...\")\n",
+    "best_model.fit(X_train, y_train)\n",
+    "print(\"Model training completed.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Model Evaluation\n",
+    "We evaluate the trained model on the test dataset using RMSE, MAPE, and R² Score as performance metrics."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_pred = best_model.predict(X_test)\n",
+    "mse = mean_squared_error(y_test, y_pred)\n",
+    "rmse = np.sqrt(mse)\n",
+    "r2 = r2_score(y_test, y_pred)\n",
+    "mape = mean_absolute_percentage_error(y_test, y_pred)\n",
+    "print(f\"\\nModel evaluation results:\\n - RMSE: {rmse:.4f}\\n - MAPE: {mape:.4f}\\n - R² Score: {r2:.4f}\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Feature Importances\n",
+    "We analyze the importance of each feature to understand how much each one contributed to the model's predictions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\nFeature Importances:\")\n",
+    "feature_importances = best_model.feature_importances_\n",
+    "for feature, importance in zip(features.columns, feature_importances):\n",
+    "    print(f\"Feature: {feature}, Importance: {importance:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Saving the Trained Model\n",
+    "We save the trained model for future use."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"\\nSaving the trained model...\")\n",
+    "os.makedirs(\"trained_models\", exist_ok=True)\n",
+    "model_path = \"trained_models/gradient_boosting_model.pkl\"\n",
+    "joblib.dump(best_model, model_path)\n",
+    "print(f\"Trained model saved successfully at: {model_path}\")\n",
+    "y_pred_full = best_model.predict(features_imputed_df)\n",
+    "prediction = pd.Series(y_pred_full)\n",
+    "\n",
+    "#saving test and train data\n",
+    "prediction.to_csv(\"data/prediction_data.csv\")\n",
+    "target.to_csv(\"data/test_data.csv\")\n",
+    "print(\"test and train data is saved....\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "------------\n",
+    "## Additional information\n",
+    "\n",
+    "<b> License </b> The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).\n",
+    "\n",
+    "Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.\n",
+    "\n",
+    "<b> Contact </b> If you need assistance, please post a question on the [DE Africa Slack channel](https://digitalearthafrica.slack.com/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).\n",
+    "\n",
+    "If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).\n",
+    "\n",
+    "<b> Compatible datacube version </b>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**References:**\n",
+    "- Garcia Andarcia, M., Dickens, C., Silva, P., Matheswaran, K., & Koo, J. (2024). Digital Twin for management of water resources in the Limpopo River Basin: a concept. Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 4p.\n",
+    "- Gurusinghe, T., Muthuwatta, L., Matheswaran, K., & Dickens, C. (2024). Developing a foundational hydrological model for the Limpopo River Basin using the Soil and Water Assessment Tool Plus (SWAT+). Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 14p.\n",
+    "- Leitão, P. C., Santos, F., Barreiros, D., Santos, H., Silva, P., Madushanka, T., Matheswaran, K., Mutuwatte, L., Vickneswaran, K., Retief, H., Dickens, C., Garcia Andarcia, M. (2024). Operational SWAT+ Model: Advancing Seasonal Forecasting in the Limpopo River Basin. Colombo, Sri Lanka:  International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation.\n",
+    "- Mallick, Archita & Ghosh, Surajit & De Sarkar, Kounik & Roy, Sudip. (2024). Reservoir Water Level Forecasting using Deep Learning Technique. Conference: 2024 IEEE India Geoscience and Remote Sensing Symposium"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Embed into project background: \n",
+    "The CGIAR Digital Innovation Initiative accelerates the transformation towards sustainable and inclusive agrifood systems by generating research-based evidence and innovative digital solutions. It is one of 32 initiatives of CGIAR, a global research partnership for a food-secure future, dedicated to transforming food, land, and water systems in a climate crisis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Contributors\n",
+    "\n",
+    "**Hugo Retief**  \n",
+    "*Researcher*  \n",
+    "Email: [hugo@award.org.za](mailto:hugo@award.org.za)  \n",
+    "\n",
+    "**Victoria Neema**  \n",
+    "*Earth Observation Scientist*  \n",
+    "Email: [victoria.neema@digitalearthafrica.org](mailto:victoria.neema@digitalearthafrica.org)  \n",
+    "\n",
+    "**Kayathri Vigneswaran**  \n",
+    "*Junior Data Scientist*  \n",
+    "Email: [v.kayathri@cgiar.org](mailto:v.kayathri@cgiar.org)  \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(datacube.__version__)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Last Tested:**"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datetime import datetime\n",
+    "datetime.today().strftime('%Y-%m-%d')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}