Skip to content
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
859 changes: 859 additions & 0 deletions Use_cases/Limpopo_River_Basin/01_dam_volume_data_preprocessing.ipynb

Large diffs are not rendered by default.

396 changes: 396 additions & 0 deletions Use_cases/Limpopo_River_Basin/02_dam_volume_model_training.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,396 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modelling dam volumes using DEA waterbodies\n",
"# Section 02 : *Model Training*\n",
"**Products used:** \n",
"[DE Africa Waterbodies](https://docs.digitalearthafrica.org/en/latest/data_specs/Waterbodies_specs.html), \n",
"[Department of Water Affairs and Sanitation, South Africa Dam Level and Volume Data](https://www.dws.gov.za/Hydrology/Verified/hymain.aspx)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Background\n",
"\n",
"### Digital Twin (DT)\n",
"The CGIAR Digital Twin initiative creates dynamic virtual models that combine real-time data, AI, and simulations to improve decision-making. Its prototype for the Limpopo River Basin focuses on enhancing water resource management and conservation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Description\n",
"This notebook presents a workflow for predicting dam levels and volumes using water surface area data from DE Africa's Waterbodies product, integrating data preprocessing, feature extraction, and Gradient Boosting modeling. \n",
"\n",
"As part of the CGIAR Initiative on Digital Innovation, this work contributes to a prototype Digital Twin for the Limpopo River Basin, designed to support real-time decision-making in water management. The Digital Twin leverages AI-driven tools to visualize and simulate the impact of decisions on the basin's ecosystem. To enhance prediction reliability, the model includes a correction mechanism to address unrealistic large drops in dam volume estimates.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting started\n",
"\n",
"To run this analysis, run all the cells in the notebook, starting with the \"Load packages\" cell. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load packages\n",
"Import Python packages that are used for the analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"import xarray as xr\n",
"import seaborn as sns\n",
"import datacube\n",
"import joblib\n",
"\n",
"from scipy import interpolate\n",
"from scipy.optimize import curve_fit\n",
"from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score\n",
"from sklearn.preprocessing import StandardScaler, PolynomialFeatures\n",
"from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_percentage_error\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.ensemble import GradientBoostingRegressor\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import plotly.graph_objs as go\n",
"import plotly.express as px\n",
"import plotly.graph_objects as go\n",
"\n",
"from deafrica_tools.waterbodies import get_waterbody, get_time_series, display_time_series\n",
"from IPython.display import Image"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Model Training Datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"### Sample Data Overview\n",
"\n",
"This dataset contains raw water levels data collected from DEA (Department of Environmental Affairs) in South Africa. The cell below reads this ancillary data necessary to conduct the volume prediction."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Preparation and Feature Selection\n",
"This section involves selecting the relevant features and handling missing values."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"merged_data = pd.read_csv(\"data/preprocess_data.csv\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = merged_data[['calculated_level', 'water_area_ha']]\n",
"target = merged_data['Dam_Level']\n",
"print(f\"Initial data shape: {features.shape}\")\n",
"print(f\"Missing values in features:\\n{features.isnull().sum()}\")\n",
"\n",
"imputer = SimpleImputer(strategy='mean')\n",
"features_imputed = imputer.fit_transform(features)\n",
"\n",
"features_imputed_df = pd.DataFrame(features_imputed, columns=features.columns)\n",
"print(f\"Data shape after imputing missing values: {features_imputed_df.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Splitting Data for Training and Testing\n",
"Here, we split the data into training and test sets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(features_imputed_df, target, test_size=0.2, random_state=42)\n",
"print(f\"Training set size: {X_train.shape[0]} rows\")\n",
"print(f\"Test set size: {X_test.shape[0]} rows\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Performing Cross-Validation\n",
"This step involves evaluating the model using cross-validation."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Performing cross-validation...\")\n",
"gradient_boosting = GradientBoostingRegressor(random_state=42)\n",
"cv_scores = cross_val_score(gradient_boosting, X_train, y_train, cv=5, scoring='neg_mean_squared_error')\n",
"cv_rmse = np.sqrt(-cv_scores).mean()\n",
"print(f\"Cross-validated RMSE: {cv_rmse:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Hyperparameter Tuning\n",
"We perform hyperparameter tuning using GridSearchCV to find the best combination of parameters for the Gradient Boosting Regressor."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {\n",
" 'n_estimators': [100, 200, 300],\n",
" 'learning_rate': [0.01, 0.1, 0.2],\n",
" 'max_depth': [3, 4, 5]\n",
"}\n",
"print(\"Performing hyperparameter tuning using GridSearchCV...\")\n",
"grid_search = GridSearchCV(GradientBoostingRegressor(random_state=42), param_grid, cv=3, scoring='neg_mean_squared_error')\n",
"grid_search.fit(X_train, y_train)\n",
"best_model = grid_search.best_estimator_\n",
"print(f\"Best parameters: {grid_search.best_params_}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training the Best Model\n",
"We now train the model with the optimal hyperparameters found in the previous step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"\\nTraining the best model with optimal hyperparameters...\")\n",
"best_model.fit(X_train, y_train)\n",
"print(\"Model training completed.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model Evaluation\n",
"We evaluate the trained model on the test dataset using RMSE, MAPE, and R² Score as performance metrics."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred = best_model.predict(X_test)\n",
"mse = mean_squared_error(y_test, y_pred)\n",
"rmse = np.sqrt(mse)\n",
"r2 = r2_score(y_test, y_pred)\n",
"mape = mean_absolute_percentage_error(y_test, y_pred)\n",
"print(f\"\\nModel evaluation results:\\n - RMSE: {rmse:.4f}\\n - MAPE: {mape:.4f}\\n - R² Score: {r2:.4f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Feature Importances\n",
"We analyze the importance of each feature to understand how much each one contributed to the model's predictions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"\\nFeature Importances:\")\n",
"feature_importances = best_model.feature_importances_\n",
"for feature, importance in zip(features.columns, feature_importances):\n",
" print(f\"Feature: {feature}, Importance: {importance:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving the Trained Model\n",
"We save the trained model for future use."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"\\nSaving the trained model...\")\n",
"os.makedirs(\"trained_models\", exist_ok=True)\n",
"model_path = \"trained_models/gradient_boosting_model.pkl\"\n",
"joblib.dump(best_model, model_path)\n",
"print(f\"Trained model saved successfully at: {model_path}\")\n",
"y_pred_full = best_model.predict(features_imputed_df)\n",
"prediction = pd.Series(y_pred_full)\n",
"\n",
"#saving test and train data\n",
"prediction.to_csv(\"data/prediction_data.csv\")\n",
"target.to_csv(\"data/test_data.csv\")\n",
"print(\"test and train data is saved....\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"------------\n",
"## Additional information\n",
"\n",
"<b> License </b> The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).\n",
"\n",
"Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.\n",
"\n",
"<b> Contact </b> If you need assistance, please post a question on the [DE Africa Slack channel](https://digitalearthafrica.slack.com/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).\n",
"\n",
"If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).\n",
"\n",
"<b> Compatible datacube version </b>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**References:**\n",
"- Garcia Andarcia, M., Dickens, C., Silva, P., Matheswaran, K., & Koo, J. (2024). Digital Twin for management of water resources in the Limpopo River Basin: a concept. Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 4p.\n",
"- Gurusinghe, T., Muthuwatta, L., Matheswaran, K., & Dickens, C. (2024). Developing a foundational hydrological model for the Limpopo River Basin using the Soil and Water Assessment Tool Plus (SWAT+). Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation. 14p.\n",
"- Leitão, P. C., Santos, F., Barreiros, D., Santos, H., Silva, P., Madushanka, T., Matheswaran, K., Mutuwatte, L., Vickneswaran, K., Retief, H., Dickens, C., Garcia Andarcia, M. (2024). Operational SWAT+ Model: Advancing Seasonal Forecasting in the Limpopo River Basin. Colombo, Sri Lanka: International Water Management Institute (IWMI). CGIAR Initiative on Digital Innovation.\n",
"- Mallick, Archita & Ghosh, Surajit & De Sarkar, Kounik & Roy, Sudip. (2024). Reservoir Water Level Forecasting using Deep Learning Technique. Conference: 2024 IEEE India Geoscience and Remote Sensing Symposium"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Embed into project background: \n",
"The CGIAR Digital Innovation Initiative accelerates the transformation towards sustainable and inclusive agrifood systems by generating research-based evidence and innovative digital solutions. It is one of 32 initiatives of CGIAR, a global research partnership for a food-secure future, dedicated to transforming food, land, and water systems in a climate crisis."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Contributors\n",
"\n",
"**Hugo Retief** \n",
"*Researcher* \n",
"Email: [hugo@award.org.za](mailto:hugo@award.org.za) \n",
"\n",
"**Victoria Neema** \n",
"*Earth Observation Scientist* \n",
"Email: [victoria.neema@digitalearthafrica.org](mailto:victoria.neema@digitalearthafrica.org) \n",
"\n",
"**Kayathri Vigneswaran** \n",
"*Junior Data Scientist* \n",
"Email: [v.kayathri@cgiar.org](mailto:v.kayathri@cgiar.org) \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(datacube.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Last Tested:**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"datetime.today().strftime('%Y-%m-%d')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Loading