Skip to content

Commit 9eef54c

Browse files
committed
Final code updates
1 parent 8738735 commit 9eef54c

14 files changed

+874
-163
lines changed

README.md

Lines changed: 73 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Repository for code and results of NGI's contribution to the 1st International U
44

55
## Overview
66

7-
This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes comprehensive data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
7+
This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
88

99
## Project Structure
1010

@@ -14,14 +14,18 @@ This project develops machine learning models for predicting Tunnel Boring Machi
1414
│ ├── optimize_hyperparameters.py # Single model hyperparameter optimization
1515
│ ├── train.py # Model training with best hyperparameters
1616
│ ├── preprocess.py # Data preprocessing pipeline
17-
│ ├── select_features.py # Feature selection utilities
17+
│ ├── undersampling_cost_analysis.py # Analyze undersampling fraction vs prediction cost
18+
│ ├── generate_cv_results_table.py # Generate results table from YAML files
19+
│ ├── analyze_collapse_sections.py # Analyze collapse section distributions
20+
│ ├── permutation_feature_importance.py # Feature importance analysis
1821
│ ├── EDA.py # Exploratory data analysis
1922
│ └── config/ # Hydra configuration files
2023
│ └── main.yaml # Main configuration (CV folds, sampling, MLflow)
2124
├── src/tbm_ml/ # Core ML library
22-
│ ├── hyperparameter_optimization.py # Optuna-based hyperparameter search
25+
│ ├── hyperparameter_optimization.py # Optuna-based hyperparameter search with StratifiedGroupKFold
2326
│ ├── train_eval_funcs.py # Training pipelines and evaluation
2427
│ ├── preprocess_funcs.py # Data preprocessing functions
28+
│ ├── collapse_section_split.py # Collapse section-aware train/test splitting
2529
│ ├── plotting.py # Visualization utilities
2630
│ └── schema_config.py # Pydantic configuration schemas
2731
├── data/ # Dataset storage
@@ -31,21 +35,24 @@ This project develops machine learning models for predicting Tunnel Boring Machi
3135
├── experiments/ # Experiment outputs
3236
│ ├── mlruns/ # MLflow tracking data
3337
│ └── hyperparameters/ # Best hyperparameter YAML files
38+
├── analyses/ # Analysis outputs
39+
│ ├── undersampling_analysis/ # Cost vs undersampling fraction analysis
40+
│ └── profile_report/ # Data profiling reports
3441
└── docs/ # Documentation
35-
3642
```
3743

3844
## Key Features
3945

4046
### Data Preprocessing
4147

4248
- **Outlier Detection**: Isolation Forest-based outlier removal
49+
- **Collapse Section Tracking**: Identifies and indexes continuous collapse sections for group-aware CV
4350
- **Class Imbalance Handling**:
4451
- RandomUnderSampler for majority class reduction (ratio-based or absolute)
45-
- SMOTE for minority class oversampling
52+
- SMOTE for minority class oversampling (not used in final models)
4653
- Configurable undersampling ratio (e.g., 1.0 = 1:1 majority:minority ratio)
4754
- **Feature Scaling**: StandardScaler normalization
48-
- **Train/Test Split**: Stratified split preserving class distributions
55+
- **Train/Test Split**: Section-aware split (sections 1-15 train, 16-18 test) preserving collapse section integrity
4956

5057
### Model Support
5158

@@ -59,8 +66,9 @@ The framework supports 11 machine learning models:
5966
### Hyperparameter Optimization
6067

6168
- **Framework**: Optuna with TPE (Tree-structured Parzen Estimator) sampler
62-
- **Cross-Validation**: Stratified K-Fold (configurable, default 5-fold)
63-
- **Metrics**: Balanced accuracy (primary), precision, recall, F1-score
69+
- **Cross-Validation**: StratifiedGroupKFold (5-fold) to keep collapse sections together during CV
70+
- **Metrics**: Cost-based optimization (minimizing expected prediction cost), balanced accuracy, precision, recall, F1-score
71+
- **Cost Matrix**: Configurable costs for TN, FP, FN, TP with time per regular advance multiplier
6472
- **Trials**: 100 trials per model (configurable)
6573
- **Logging**: MLflow integration for experiment tracking and visualization
6674

@@ -95,6 +103,14 @@ uv sync
95103

96104
## Usage
97105

106+
### 0. Run the preliminary tests script
107+
108+
Note: This is just to generate the merged csv file used in subsequent steps.
109+
110+
```bash
111+
python scripts/preliminary_tests.py
112+
```
113+
98114
### 1. Data Preprocessing
99115

100116
```bash
@@ -127,7 +143,20 @@ python scripts/train.py
127143

128144
Trains models using the best hyperparameters found during optimization.
129145

130-
### 4. MLflow UI
146+
### 4. Undersampling Cost Analysis
147+
148+
```bash
149+
python scripts/undersampling_cost_analysis.py
150+
```
151+
152+
Analyzes how different undersampling fractions affect prediction cost. Generates:
153+
154+
- Cost vs undersampling fraction plots (CV and test set)
155+
- Accuracy metrics plots (CV and test set)
156+
- Confusion matrices for different undersampling ratios
157+
- Results CSV with optimal undersampling fraction
158+
159+
### 5. MLflow UI
131160

132161
```bash
133162
python -m mlflow ui --backend-store-uri ./experiments/mlruns --port 5000
@@ -142,41 +171,51 @@ Main configuration file: `scripts/config/main.yaml`
142171
Key parameters:
143172

144173
```yaml
145-
data:
146-
train_path: "data/model_ready/dataset_train.csv"
147-
test_path: "data/model_ready/dataset_test.csv"
148-
149-
preprocessing:
174+
experiment:
150175
undersample_level: null # Absolute number (overrides ratio if set), or null to use ratio
151176
undersample_ratio: 1.0 # Majority:minority ratio (1.0 = 1:1, 2.0 = 2:1, etc.)
152177
oversample_level: 0 # Minority class SMOTE (0=disabled)
153-
outlier_removal: true
154-
155-
optimization:
178+
179+
# Cost matrix for prediction cost calculation (in hours per regular advance)
180+
cost_matrix:
181+
tn_cost: 1.0 # True Regular (correct prediction of regular)
182+
fp_cost: 10.0 # False Collapse (false alarm)
183+
fn_cost: 240.0 # False Regular (missed collapse - most costly)
184+
tp_cost: 10.0 # True Collapse (correct prediction of collapse)
185+
time_per_regular_advance: 1.0 # Time unit multiplier
186+
187+
optuna:
156188
n_trials: 100 # Optuna trials per model
157-
cv_folds: 5 # Cross-validation folds
158-
metric: "balanced_accuracy"
189+
cv_folds: 5 # Cross-validation folds (uses StratifiedGroupKFold)
190+
path_results: experiments/hyperparameters
159191

160192
mlflow:
161-
path: "./experiments/mlruns"
193+
path: ./experiments/mlruns
162194
experiment_name: null # Set programmatically by scripts
163195
```
164196
165197
## Models Configuration
166198
167-
Edit `scripts/batch_optimize.py` to select models:
199+
The models are configured in `scripts/batch_optimize.py`:
168200

169201
```python
170-
MODELS = [
202+
# Models that undergo hyperparameter optimization
203+
MODELS_TO_OPTIMIZE = [
171204
"xgboost",
172205
"random_forest",
173206
"extra_trees",
174207
"hist_gradient_boosting",
175208
"catboost",
176209
"lightgbm",
177-
"logistic_regression",
178210
"svm",
179211
"knn",
212+
"gaussian_process",
213+
"logistic_regression",
214+
]
215+
216+
# Baseline models (evaluated without optimization)
217+
BASELINE_MODELS = [
218+
"dummy", # Always predicts majority class
180219
]
181220
```
182221

@@ -190,6 +229,17 @@ Best hyperparameters saved to `experiments/hyperparameters/`:
190229
best_hyperparameters_{model_name}_{timestamp}.yaml
191230
```
192231

232+
Each file contains:
233+
234+
- Optimized hyperparameters
235+
- CV performance metrics (cost-based)
236+
- Test set metrics (accuracy, balanced accuracy, recall, precision, F1, cost)
237+
- Optimization metadata (trial number, duration, timestamp)
238+
239+
### Analysis Outputs
240+
241+
- `analyses/undersampling_analysis/{timestamp}/` - Undersampling analysis plots and data
242+
193243
### MLflow Artifacts
194244

195245
- Optimization history plots

analyze_collapse_sections.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
import pandas as pd
2+
import numpy as np
3+
4+
df = pd.read_csv('data/raw/combined_data.csv')
5+
6+
# Sort by Chainage or Tunnellength to ensure proper ordering
7+
if 'Chainage' in df.columns:
8+
df = df.sort_values('Chainage', ascending=False).reset_index(drop=True)
9+
elif 'Tunnellength [m]' in df.columns:
10+
df = df.sort_values('Tunnellength [m]').reset_index(drop=True)
11+
12+
print('Total samples:', len(df))
13+
print('Collapse samples:', df['collapse'].sum())
14+
print('Non-collapse samples:', (df['collapse'] == 0).sum())
15+
16+
# Identify continuous collapse sections
17+
# A new section starts when current row is collapse (1) and previous row is not collapse (0)
18+
df['is_collapse'] = df['collapse'] == 1
19+
df['collapse_section'] = (df['is_collapse'] & (~df['is_collapse'].shift(1, fill_value=False))).cumsum()
20+
21+
# Set section to 0 for non-collapse rows
22+
df.loc[~df['is_collapse'], 'collapse_section'] = 0
23+
24+
print('\nCollapse sections found:', df[df['collapse_section'] > 0]['collapse_section'].nunique())
25+
print('\nCollapse section sizes:')
26+
section_sizes = df[df['collapse_section'] > 0].groupby('collapse_section').size()
27+
print(section_sizes)
28+
print('\nStatistics:')
29+
print('Mean section size:', section_sizes.mean())
30+
print('Median section size:', section_sizes.median())
31+
print('Min section size:', section_sizes.min())
32+
print('Max section size:', section_sizes.max())
33+
34+
# Show some examples
35+
print('\n\nExample rows from first collapse section:')
36+
first_section = df[df['collapse_section'] == 1][['Chainage', 'Tunnellength [m]', 'collapse', 'collapse_section']].head(10)
37+
print(first_section)
38+
39+
print('\n\nExample rows from second collapse section:')
40+
second_section = df[df['collapse_section'] == 2][['Chainage', 'Tunnellength [m]', 'collapse', 'collapse_section']].head(10)
41+
print(second_section)

scripts/EDA.py

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -122,24 +122,24 @@ def main(cfg: DictConfig) -> None:
122122
console.print(table)
123123
console.print()
124124

125-
# Generate profile report
126-
console.print("[bold green]Generating profile report...[/bold green]")
127-
from pathlib import Path
128-
try:
129-
from ydata_profiling import ProfileReport
125+
# # Generate profile report
126+
# console.print("[bold green]Generating profile report...[/bold green]")
127+
# from pathlib import Path
128+
# try:
129+
# from ydata_profiling import ProfileReport
130130

131-
report_dir = Path("analyses/profile_report")
132-
report_dir.mkdir(parents=True, exist_ok=True)
131+
# report_dir = Path("analyses/profile_report")
132+
# report_dir.mkdir(parents=True, exist_ok=True)
133133

134-
profile = ProfileReport(df, title="TBM Tunnel Data Profile Report", explorative=True)
135-
report_path = report_dir / "tbm_data_profile_report.html"
136-
profile.to_file(report_path)
134+
# profile = ProfileReport(df, title="TBM Tunnel Data Profile Report", explorative=True)
135+
# report_path = report_dir / "tbm_data_profile_report.html"
136+
# profile.to_file(report_path)
137137

138-
console.print(f"[bold green]Profile report saved to: {report_path}[/bold green]")
139-
except ImportError as e:
140-
console.print(f"[bold yellow]Warning: Could not generate profile report. Error: {e}[/bold yellow]")
141-
console.print("[bold yellow]Try installing setuptools: uv add setuptools[/bold yellow]")
142-
console.print()
138+
# console.print(f"[bold green]Profile report saved to: {report_path}[/bold green]")
139+
# except ImportError as e:
140+
# console.print(f"[bold yellow]Warning: Could not generate profile report. Error: {e}[/bold yellow]")
141+
# console.print("[bold yellow]Try installing setuptools: uv add setuptools[/bold yellow]")
142+
# console.print()
143143

144144

145145
if __name__ == "__main__":
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Analyze collapse sections in the TBM dataset.
4+
5+
This script provides detailed analysis of continuous collapse sections,
6+
including distribution, sizes, and recommendations for train/test splitting.
7+
8+
Usage:
9+
python scripts/analyze_collapse_sections.py
10+
"""
11+
12+
import pandas as pd
13+
import matplotlib.pyplot as plt
14+
from pathlib import Path
15+
16+
17+
def analyze_collapse_sections(data_path: str = "data/model_ready/dataset_total.csv"):
18+
"""Analyze collapse sections in the dataset."""
19+
20+
# Load data
21+
df = pd.read_csv(data_path)
22+
23+
print("=" * 80)
24+
print("COLLAPSE SECTION ANALYSIS")
25+
print("=" * 80)
26+
27+
# Basic statistics
28+
total_samples = len(df)
29+
collapse_samples = df['collapse'].sum()
30+
non_collapse_samples = (df['collapse'] == 0).sum()
31+
32+
print(f"\nDataset Overview:")
33+
print(f" Total samples: {total_samples:,}")
34+
print(f" Collapse samples: {collapse_samples:,} ({collapse_samples/total_samples*100:.2f}%)")
35+
print(f" Non-collapse samples: {non_collapse_samples:,} ({non_collapse_samples/total_samples*100:.2f}%)")
36+
37+
# Collapse section analysis
38+
collapse_df = df[df['collapse_section'] > 0]
39+
n_sections = collapse_df['collapse_section'].nunique()
40+
41+
print(f"\nCollapse Sections:")
42+
print(f" Total continuous collapse sections: {n_sections}")
43+
44+
# Section sizes
45+
section_sizes = collapse_df.groupby('collapse_section').size().sort_values(ascending=False)
46+
47+
print(f"\nSection Size Statistics:")
48+
print(f" Mean: {section_sizes.mean():.1f} samples")
49+
print(f" Median: {section_sizes.median():.1f} samples")
50+
print(f" Min: {section_sizes.min()} samples (Section {section_sizes.idxmin()})")
51+
print(f" Max: {section_sizes.max()} samples (Section {section_sizes.idxmax()})")
52+
print(f" Std Dev: {section_sizes.std():.1f} samples")
53+
54+
print(f"\nAll Collapse Sections (sorted by size):")
55+
print("-" * 40)
56+
for section_id, size in section_sizes.items():
57+
print(f" Section {section_id:2d}: {size:3d} samples")
58+
59+
# Visualization
60+
print("\n" + "=" * 80)
61+
print("Generating visualization...")
62+
63+
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
64+
65+
# Plot 1: Bar chart of section sizes
66+
ax1 = axes[0]
67+
section_sizes_sorted = section_sizes.sort_index()
68+
ax1.bar(section_sizes_sorted.index, section_sizes_sorted.values,
69+
color='#EF4927', alpha=0.7, edgecolor='black')
70+
ax1.set_xlabel('Collapse Section ID', fontsize=11)
71+
ax1.set_ylabel('Number of Samples', fontsize=11)
72+
ax1.set_title('Collapse Section Sizes', fontsize=13, fontweight='bold')
73+
ax1.grid(True, alpha=0.3, axis='y')
74+
ax1.set_xticks(range(1, n_sections + 1))
75+
76+
# Add mean line
77+
ax1.axhline(y=section_sizes.mean(), color='blue', linestyle='--',
78+
linewidth=2, label=f'Mean: {section_sizes.mean():.1f}')
79+
ax1.legend(fontsize=10)
80+
81+
# Plot 2: Cumulative distribution
82+
ax2 = axes[1]
83+
cumsum_sorted = section_sizes_sorted.cumsum()
84+
ax2.plot(cumsum_sorted.index, cumsum_sorted.values,
85+
marker='o', linewidth=2, markersize=6, color='#1B9AD7')
86+
ax2.fill_between(cumsum_sorted.index, 0, cumsum_sorted.values, alpha=0.3, color='#1B9AD7')
87+
ax2.set_xlabel('Collapse Section ID', fontsize=11)
88+
ax2.set_ylabel('Cumulative Collapse Samples', fontsize=11)
89+
ax2.set_title('Cumulative Distribution of Collapse Samples', fontsize=13, fontweight='bold')
90+
ax2.grid(True, alpha=0.3)
91+
ax2.set_xticks(range(1, n_sections + 1))
92+
93+
# Add 75% line for train/test split reference
94+
target_75 = collapse_samples * 0.75
95+
ax2.axhline(y=target_75, color='red', linestyle='--',
96+
linewidth=2, label=f'75% split: {target_75:.0f} samples')
97+
ax2.legend(fontsize=10)
98+
99+
plt.tight_layout()
100+
101+
# Save plot
102+
output_path = Path("analyses") / "collapse_section_analysis.png"
103+
output_path.parent.mkdir(exist_ok=True, parents=True)
104+
plt.savefig(output_path, dpi=300, bbox_inches='tight')
105+
print(f"Visualization saved to: {output_path}")
106+
107+
# Save detailed report
108+
report_path = Path("analyses") / "collapse_section_report.csv"
109+
section_report = pd.DataFrame({
110+
'collapse_section': section_sizes.index,
111+
'n_samples': section_sizes.values,
112+
'cumulative_samples': section_sizes.sort_index().cumsum().values,
113+
'percentage_of_total_collapses': (section_sizes.values / collapse_samples * 100).round(2)
114+
})
115+
section_report.to_csv(report_path, index=False)
116+
print(f"Detailed report saved to: {report_path}")
117+
118+
print("\n" + "=" * 80)
119+
print("Analysis complete!")
120+
print("=" * 80)
121+
122+
123+
if __name__ == "__main__":
124+
analyze_collapse_sections()

0 commit comments

Comments
 (0)