@@ -4,7 +4,7 @@ Repository for code and results of NGI's contribution to the 1st International U
44
55## Overview
66
7- This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes comprehensive data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
7+ This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
88
99## Project Structure
1010
@@ -14,14 +14,18 @@ This project develops machine learning models for predicting Tunnel Boring Machi
1414│ ├── optimize_hyperparameters.py # Single model hyperparameter optimization
1515│ ├── train.py # Model training with best hyperparameters
1616│ ├── preprocess.py # Data preprocessing pipeline
17- │ ├── select_features.py # Feature selection utilities
17+ │ ├── undersampling_cost_analysis.py # Analyze undersampling fraction vs prediction cost
18+ │ ├── generate_cv_results_table.py # Generate results table from YAML files
19+ │ ├── analyze_collapse_sections.py # Analyze collapse section distributions
20+ │ ├── permutation_feature_importance.py # Feature importance analysis
1821│ ├── EDA.py # Exploratory data analysis
1922│ └── config/ # Hydra configuration files
2023│ └── main.yaml # Main configuration (CV folds, sampling, MLflow)
2124├── src/tbm_ml/ # Core ML library
22- │ ├── hyperparameter_optimization.py # Optuna-based hyperparameter search
25+ │ ├── hyperparameter_optimization.py # Optuna-based hyperparameter search with StratifiedGroupKFold
2326│ ├── train_eval_funcs.py # Training pipelines and evaluation
2427│ ├── preprocess_funcs.py # Data preprocessing functions
28+ │ ├── collapse_section_split.py # Collapse section-aware train/test splitting
2529│ ├── plotting.py # Visualization utilities
2630│ └── schema_config.py # Pydantic configuration schemas
2731├── data/ # Dataset storage
@@ -31,21 +35,24 @@ This project develops machine learning models for predicting Tunnel Boring Machi
3135├── experiments/ # Experiment outputs
3236│ ├── mlruns/ # MLflow tracking data
3337│ └── hyperparameters/ # Best hyperparameter YAML files
38+ ├── analyses/ # Analysis outputs
39+ │ ├── undersampling_analysis/ # Cost vs undersampling fraction analysis
40+ │ └── profile_report/ # Data profiling reports
3441└── docs/ # Documentation
35-
3642```
3743
3844## Key Features
3945
4046### Data Preprocessing
4147
4248- ** Outlier Detection** : Isolation Forest-based outlier removal
49+ - ** Collapse Section Tracking** : Identifies and indexes continuous collapse sections for group-aware CV
4350- ** Class Imbalance Handling** :
4451 - RandomUnderSampler for majority class reduction (ratio-based or absolute)
45- - SMOTE for minority class oversampling
52+ - SMOTE for minority class oversampling (not used in final models)
4653 - Configurable undersampling ratio (e.g., 1.0 = 1:1 majority: minority ratio)
4754- ** Feature Scaling** : StandardScaler normalization
48- - ** Train/Test Split** : Stratified split preserving class distributions
55+ - ** Train/Test Split** : Section-aware split (sections 1-15 train, 16-18 test) preserving collapse section integrity
4956
5057### Model Support
5158
@@ -59,8 +66,9 @@ The framework supports 11 machine learning models:
5966### Hyperparameter Optimization
6067
6168- ** Framework** : Optuna with TPE (Tree-structured Parzen Estimator) sampler
62- - ** Cross-Validation** : Stratified K-Fold (configurable, default 5-fold)
63- - ** Metrics** : Balanced accuracy (primary), precision, recall, F1-score
69+ - ** Cross-Validation** : StratifiedGroupKFold (5-fold) to keep collapse sections together during CV
70+ - ** Metrics** : Cost-based optimization (minimizing expected prediction cost), balanced accuracy, precision, recall, F1-score
71+ - ** Cost Matrix** : Configurable costs for TN, FP, FN, TP with time per regular advance multiplier
6472- ** Trials** : 100 trials per model (configurable)
6573- ** Logging** : MLflow integration for experiment tracking and visualization
6674
@@ -95,6 +103,14 @@ uv sync
95103
96104## Usage
97105
106+ ### 0. Run the preliminary tests script
107+
108+ Note: This is just to generate the merged csv file used in subsequent steps.
109+
110+ ``` bash
111+ python scripts/preliminary_tests.py
112+ ```
113+
98114### 1. Data Preprocessing
99115
100116``` bash
@@ -127,7 +143,20 @@ python scripts/train.py
127143
128144Trains models using the best hyperparameters found during optimization.
129145
130- ### 4. MLflow UI
146+ ### 4. Undersampling Cost Analysis
147+
148+ ``` bash
149+ python scripts/undersampling_cost_analysis.py
150+ ```
151+
152+ Analyzes how different undersampling fractions affect prediction cost. Generates:
153+
154+ - Cost vs undersampling fraction plots (CV and test set)
155+ - Accuracy metrics plots (CV and test set)
156+ - Confusion matrices for different undersampling ratios
157+ - Results CSV with optimal undersampling fraction
158+
159+ ### 5. MLflow UI
131160
132161``` bash
133162python -m mlflow ui --backend-store-uri ./experiments/mlruns --port 5000
@@ -142,41 +171,51 @@ Main configuration file: `scripts/config/main.yaml`
142171Key parameters:
143172
144173``` yaml
145- data :
146- train_path : " data/model_ready/dataset_train.csv"
147- test_path : " data/model_ready/dataset_test.csv"
148-
149- preprocessing :
174+ experiment :
150175 undersample_level : null # Absolute number (overrides ratio if set), or null to use ratio
151176 undersample_ratio : 1.0 # Majority:minority ratio (1.0 = 1:1, 2.0 = 2:1, etc.)
152177 oversample_level : 0 # Minority class SMOTE (0=disabled)
153- outlier_removal : true
154-
155- optimization :
178+
179+ # Cost matrix for prediction cost calculation (in hours per regular advance)
180+ cost_matrix :
181+ tn_cost : 1.0 # True Regular (correct prediction of regular)
182+ fp_cost : 10.0 # False Collapse (false alarm)
183+ fn_cost : 240.0 # False Regular (missed collapse - most costly)
184+ tp_cost : 10.0 # True Collapse (correct prediction of collapse)
185+ time_per_regular_advance : 1.0 # Time unit multiplier
186+
187+ optuna :
156188 n_trials : 100 # Optuna trials per model
157- cv_folds : 5 # Cross-validation folds
158- metric : " balanced_accuracy "
189+ cv_folds : 5 # Cross-validation folds (uses StratifiedGroupKFold)
190+ path_results : experiments/hyperparameters
159191
160192mlflow :
161- path : " ./experiments/mlruns"
193+ path : ./experiments/mlruns
162194 experiment_name : null # Set programmatically by scripts
163195` ` `
164196
165197## Models Configuration
166198
167- Edit ` scripts/batch_optimize.py` to select models :
199+ The models are configured in ` scripts/batch_optimize.py`:
168200
169201` ` ` python
170- MODELS = [
202+ # Models that undergo hyperparameter optimization
203+ MODELS_TO_OPTIMIZE = [
171204 "xgboost",
172205 "random_forest",
173206 "extra_trees",
174207 "hist_gradient_boosting",
175208 "catboost",
176209 "lightgbm",
177- "logistic_regression",
178210 "svm",
179211 "knn",
212+ "gaussian_process",
213+ "logistic_regression",
214+ ]
215+
216+ # Baseline models (evaluated without optimization)
217+ BASELINE_MODELS = [
218+ "dummy", # Always predicts majority class
180219]
181220` ` `
182221
@@ -190,6 +229,17 @@ Best hyperparameters saved to `experiments/hyperparameters/`:
190229best_hyperparameters_{model_name}_{timestamp}.yaml
191230` ` `
192231
232+ Each file contains :
233+
234+ - Optimized hyperparameters
235+ - CV performance metrics (cost-based)
236+ - Test set metrics (accuracy, balanced accuracy, recall, precision, F1, cost)
237+ - Optimization metadata (trial number, duration, timestamp)
238+
239+ # ## Analysis Outputs
240+
241+ - ` analyses/undersampling_analysis/{timestamp}/` - Undersampling analysis plots and data
242+
193243# ## MLflow Artifacts
194244
195245- Optimization history plots
0 commit comments