CPT to soil type is a machine learning tool to predict the granular soil type from cone penetration test (CPT) data. The following soil types are supported:
- 1: gravel
- 2: fine grained organic soils
- 3: coarse grained organic soils
- 4: sand to gravel
- 5: sand
- 6: silt to fine sand
- 7: clay to silt
The developed model uses an XGBoost classifier, trained on the Oberhollenzer dataset: https://doi.org/10.1016/j.dib.2020.106618. The dataset can be downloaded from here.
The code is developed for use in an applied machine learning course at the Norwegian Geotechnical Institute (NGI): https://www.ngi.no/. For more information about geotechnical data and applied machine learning, check out this NGI course: Introduction to Applied Machine Learning - Using Geotechnical Data
The implementation of this end-to-end machine learning project focus on reproducability, trustworthiness and readability. The project structure aims to be a good starting point for developing machine learning models for geotechnical engineering applications.
Some of the key features of the project are:
- The projects is developed as a classic software project with functionality structured as a python package in the
srcdirectory and entry points in thescriptsdirectory. - Use of the uv package manager for dependency management and Python versions.
- In several topics the code is demonstrated in jupyter notebooks for educational purposes, then later refactored into the main codebase.
- Use of the Hydra configuration framework for easy configuration of the model and training parameters.
- Use of mlflow for tracking experiments and model parameters.
- Use of ydata-profiling for data Exploratory Data Analysis (EDA).
- Use of the XGBoost library (optionally with gpu support) for training the machine learning model.
- Use of the scikit-learn library for data preprocessing ml-algorithms (other than xgboost) and evaluation of the model.
- Use pyOD for outlier detection.
- Use imblearn for handling imbalanced datasets.
- Use pydantic for data validation.
- Use of optuna for hyperparameter optimization.
- Use of the Streamlit library for developing interactive web applications.
- Use of the Ruff and isort code formatters for code formatting.
-
Clone the repository:
git clone <repo url> cd CPT-to-soiltype
-
Install uv: Follow the instructions at https://uv.dev/ to install uv.
-
Install dependencies:
uv sync
-
Download the Oberhollenzer dataset from here and place the file
CPT_PremstallerGeotechnik_revised.csvin thedata/raw/directory. -
Preprocess the data:
uv run python scripts/preprocess.pyuv run python scripts/optimise_hyperparameters.pyuv run python scripts/select_features.pyuv run python scripts/train.pyUse hydra configuration options with train.py to specify the model and training parameters. For example, to train a model with a KNN classifier, run:
uv run python scripts/train.py model=knnSee options with:
uv run python scripts/train.py --helpOther entry-point scripts are:
uv run python scripts/preprocess.py
# hyperparameter optimization is only implemented for the XGBoost model
uv run python scripts/optimise_hyperparameters.pycd experiments
uv run mlflow uiOpen the web interface at your local machine.
Before starting the Streamlit app (Main.py), make sure you have completed the following steps:
-
Install dependencies
uv sync
-
Preprocess the data
uv run python scripts/preprocess.py
This will generate model-ready data in
data/model_ready/. -
Train the model
uv run python scripts/train.py
This will save the trained model to
models/xgb_model.json.Optionally, optimize hyperparameters:
uv run python scripts/optimise_hyperparameters.py
Then you can run the Streamlit app:
- Run the Streamlit app
uv run streamlit run Main.py --server.port 8503
These steps ensure the app has the necessary data and model files to function correctly.
Streamlit Community Cloud expects a requirements.txt at the repository root. You can export one from your uv-managed project with pinned (locked) versions and without development-only dependencies:
uv export --no-dev --no-hashes -o requirements.txtNotes:
--no-devexcludes tools like ruff, isort, mypy from the deployment image.--no-hashesavoids requiringpip --require-hasheson the Streamlit platform.- If you want to enforce the exact versions from
uv.lock, you can add--frozen.
Follow the instructions at https://docs.streamlit.io/streamlit-community-cloud/get-started/deploy-an-app to deploy the app.
The project is developed by Tom F. Hansen and Sjur Beyer.
For any questions or suggestions, please open an issue or contact us at [email protected].