norwegian-geotechnical-institute
diff --git a/‎README.md‎
Lines changed: 73 additions & 23 deletions b/‎README.md‎
Lines changed: 73 additions & 23 deletions
diff --git a/‎analyze_collapse_sections.py‎
Lines changed: 41 additions & 0 deletions b/‎analyze_collapse_sections.py‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎scripts/EDA.py‎
Lines changed: 15 additions & 15 deletions b/‎scripts/EDA.py‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎scripts/analyze_collapse_sections.py‎
Lines changed: 124 additions & 0 deletions b/‎scripts/analyze_collapse_sections.py‎
Lines changed: 124 additions & 0 deletions
@@ -4,7 +4,7 @@ Repository for code and results of NGI's contribution to the 1st International U
 
 ## Overview
 
-This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes comprehensive data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
+This project develops machine learning models for predicting Tunnel Boring Machine (TBM) collapse events using operational data from the Yinsong Water Diversion Project. The codebase includes data preprocessing, feature engineering, hyperparameter optimization, and model evaluation pipelines with MLflow experiment tracking.
 
 ## Project Structure
 
@@ -14,14 +14,18 @@ This project develops machine learning models for predicting Tunnel Boring Machi
 │   ├── optimize_hyperparameters.py  # Single model hyperparameter optimization
 │   ├── train.py                     # Model training with best hyperparameters
 │   ├── preprocess.py                # Data preprocessing pipeline
-│   ├── select_features.py           # Feature selection utilities
+│   ├── undersampling_cost_analysis.py  # Analyze undersampling fraction vs prediction cost
+│   ├── generate_cv_results_table.py # Generate results table from YAML files
+│   ├── analyze_collapse_sections.py # Analyze collapse section distributions
+│   ├── permutation_feature_importance.py  # Feature importance analysis
 │   ├── EDA.py                       # Exploratory data analysis
 │   └── config/                      # Hydra configuration files
 │       └── main.yaml                # Main configuration (CV folds, sampling, MLflow)
 ├── src/tbm_ml/                      # Core ML library
-│   ├── hyperparameter_optimization.py  # Optuna-based hyperparameter search
+│   ├── hyperparameter_optimization.py  # Optuna-based hyperparameter search with StratifiedGroupKFold
 │   ├── train_eval_funcs.py          # Training pipelines and evaluation
 │   ├── preprocess_funcs.py          # Data preprocessing functions
+│   ├── collapse_section_split.py    # Collapse section-aware train/test splitting
 │   ├── plotting.py                  # Visualization utilities
 │   └── schema_config.py             # Pydantic configuration schemas
 ├── data/                            # Dataset storage
@@ -31,21 +35,24 @@ This project develops machine learning models for predicting Tunnel Boring Machi
 ├── experiments/                     # Experiment outputs
 │   ├── mlruns/                      # MLflow tracking data
 │   └── hyperparameters/             # Best hyperparameter YAML files
+├── analyses/                        # Analysis outputs
+│   ├── undersampling_analysis/      # Cost vs undersampling fraction analysis
+│   └── profile_report/              # Data profiling reports
 └── docs/                            # Documentation
-
 ```
 
 ## Key Features
 
 ### Data Preprocessing
 
 - **Outlier Detection**: Isolation Forest-based outlier removal
+- **Collapse Section Tracking**: Identifies and indexes continuous collapse sections for group-aware CV
 - **Class Imbalance Handling**:
   - RandomUnderSampler for majority class reduction (ratio-based or absolute)
-  - SMOTE for minority class oversampling
+  - SMOTE for minority class oversampling (not used in final models)
   - Configurable undersampling ratio (e.g., 1.0 = 1:1 majority:minority ratio)
 - **Feature Scaling**: StandardScaler normalization
-- **Train/Test Split**: Stratified split preserving class distributions
+- **Train/Test Split**: Section-aware split (sections 1-15 train, 16-18 test) preserving collapse section integrity
 
 ### Model Support
 
@@ -59,8 +66,9 @@ The framework supports 11 machine learning models:
 ### Hyperparameter Optimization
 
 - **Framework**: Optuna with TPE (Tree-structured Parzen Estimator) sampler
-- **Cross-Validation**: Stratified K-Fold (configurable, default 5-fold)
-- **Metrics**: Balanced accuracy (primary), precision, recall, F1-score
+- **Cross-Validation**: StratifiedGroupKFold (5-fold) to keep collapse sections together during CV
+- **Metrics**: Cost-based optimization (minimizing expected prediction cost), balanced accuracy, precision, recall, F1-score
+- **Cost Matrix**: Configurable costs for TN, FP, FN, TP with time per regular advance multiplier
 - **Trials**: 100 trials per model (configurable)
 - **Logging**: MLflow integration for experiment tracking and visualization
 
@@ -95,6 +103,14 @@ uv sync
 
 ## Usage
 
+### 0. Run the preliminary tests script
+
+Note: This is just to generate the merged csv file used in subsequent steps.
+
+```bash
+python scripts/preliminary_tests.py
+```
+
 ### 1. Data Preprocessing
 
 ```bash
@@ -127,7 +143,20 @@ python scripts/train.py
 
 Trains models using the best hyperparameters found during optimization.
 
-### 4. MLflow UI
+### 4. Undersampling Cost Analysis
+
+```bash
+python scripts/undersampling_cost_analysis.py
+```
+
+Analyzes how different undersampling fractions affect prediction cost. Generates:
+
+- Cost vs undersampling fraction plots (CV and test set)
+- Accuracy metrics plots (CV and test set)
+- Confusion matrices for different undersampling ratios
+- Results CSV with optimal undersampling fraction
+
+### 5. MLflow UI
 
 ```bash
 python -m mlflow ui --backend-store-uri ./experiments/mlruns --port 5000
@@ -142,41 +171,51 @@ Main configuration file: `scripts/config/main.yaml`
 Key parameters:
 
 ```yaml
-data:
-  train_path: "data/model_ready/dataset_train.csv"
-  test_path: "data/model_ready/dataset_test.csv"
-
-preprocessing:
+experiment:
   undersample_level: null  # Absolute number (overrides ratio if set), or null to use ratio
   undersample_ratio: 1.0   # Majority:minority ratio (1.0 = 1:1, 2.0 = 2:1, etc.)
   oversample_level: 0      # Minority class SMOTE (0=disabled)
-  outlier_removal: true
-
-optimization:
+  
+  # Cost matrix for prediction cost calculation (in hours per regular advance)
+  cost_matrix:
+    tn_cost: 1.0           # True Regular (correct prediction of regular)
+    fp_cost: 10.0          # False Collapse (false alarm)
+    fn_cost: 240.0         # False Regular (missed collapse - most costly)
+    tp_cost: 10.0          # True Collapse (correct prediction of collapse)
+    time_per_regular_advance: 1.0  # Time unit multiplier
+
+optuna:
   n_trials: 100            # Optuna trials per model
-  cv_folds: 5              # Cross-validation folds
-  metric: "balanced_accuracy"
+  cv_folds: 5              # Cross-validation folds (uses StratifiedGroupKFold)
+  path_results: experiments/hyperparameters
 
 mlflow:
-  path: "./experiments/mlruns"
+  path: ./experiments/mlruns
   experiment_name: null    # Set programmatically by scripts
 ```
 
 ## Models Configuration
 
-Edit `scripts/batch_optimize.py` to select models:
+The models are configured in `scripts/batch_optimize.py`:
 
 ```python
-MODELS = [
+# Models that undergo hyperparameter optimization
+MODELS_TO_OPTIMIZE = [
     "xgboost",
     "random_forest",
     "extra_trees",
     "hist_gradient_boosting",
     "catboost",
     "lightgbm",
-    "logistic_regression",
     "svm",
     "knn",
+    "gaussian_process",
+    "logistic_regression",
+]
+
+# Baseline models (evaluated without optimization)
+BASELINE_MODELS = [
+    "dummy",  # Always predicts majority class
 ]
 ```
 
@@ -190,6 +229,17 @@ Best hyperparameters saved to `experiments/hyperparameters/`:
 best_hyperparameters_{model_name}_{timestamp}.yaml
 ```
 
+Each file contains:
+
+- Optimized hyperparameters
+- CV performance metrics (cost-based)
+- Test set metrics (accuracy, balanced accuracy, recall, precision, F1, cost)
+- Optimization metadata (trial number, duration, timestamp)
+
+### Analysis Outputs
+
+- `analyses/undersampling_analysis/{timestamp}/` - Undersampling analysis plots and data
+
 ### MLflow Artifacts
 
 - Optimization history plots
 
@@ -0,0 +1,41 @@
+import pandas as pd
+import numpy as np
+
+df = pd.read_csv('data/raw/combined_data.csv')
+
+# Sort by Chainage or Tunnellength to ensure proper ordering
+if 'Chainage' in df.columns:
+    df = df.sort_values('Chainage', ascending=False).reset_index(drop=True)
+elif 'Tunnellength [m]' in df.columns:
+    df = df.sort_values('Tunnellength [m]').reset_index(drop=True)
+
+print('Total samples:', len(df))
+print('Collapse samples:', df['collapse'].sum())
+print('Non-collapse samples:', (df['collapse'] == 0).sum())
+
+# Identify continuous collapse sections
+# A new section starts when current row is collapse (1) and previous row is not collapse (0)
+df['is_collapse'] = df['collapse'] == 1
+df['collapse_section'] = (df['is_collapse'] & (~df['is_collapse'].shift(1, fill_value=False))).cumsum()
+
+# Set section to 0 for non-collapse rows
+df.loc[~df['is_collapse'], 'collapse_section'] = 0
+
+print('\nCollapse sections found:', df[df['collapse_section'] > 0]['collapse_section'].nunique())
+print('\nCollapse section sizes:')
+section_sizes = df[df['collapse_section'] > 0].groupby('collapse_section').size()
+print(section_sizes)
+print('\nStatistics:')
+print('Mean section size:', section_sizes.mean())
+print('Median section size:', section_sizes.median())
+print('Min section size:', section_sizes.min())
+print('Max section size:', section_sizes.max())
+
+# Show some examples
+print('\n\nExample rows from first collapse section:')
+first_section = df[df['collapse_section'] == 1][['Chainage', 'Tunnellength [m]', 'collapse', 'collapse_section']].head(10)
+print(first_section)
+
+print('\n\nExample rows from second collapse section:')
+second_section = df[df['collapse_section'] == 2][['Chainage', 'Tunnellength [m]', 'collapse', 'collapse_section']].head(10)
+print(second_section)
@@ -122,24 +122,24 @@ def main(cfg: DictConfig) -> None:
     console.print(table)
     console.print()
 
-    # Generate profile report
-    console.print("[bold green]Generating profile report...[/bold green]")
-    from pathlib import Path
-    try:
-        from ydata_profiling import ProfileReport
+    # # Generate profile report
+    # console.print("[bold green]Generating profile report...[/bold green]")
+    # from pathlib import Path
+    # try:
+    #     from ydata_profiling import ProfileReport
 
-        report_dir = Path("analyses/profile_report")
-        report_dir.mkdir(parents=True, exist_ok=True)
+    #     report_dir = Path("analyses/profile_report")
+    #     report_dir.mkdir(parents=True, exist_ok=True)
 
-        profile = ProfileReport(df, title="TBM Tunnel Data Profile Report", explorative=True)
-        report_path = report_dir / "tbm_data_profile_report.html"
-        profile.to_file(report_path)
+    #     profile = ProfileReport(df, title="TBM Tunnel Data Profile Report", explorative=True)
+    #     report_path = report_dir / "tbm_data_profile_report.html"
+    #     profile.to_file(report_path)
 
-        console.print(f"[bold green]Profile report saved to: {report_path}[/bold green]")
-    except ImportError as e:
-        console.print(f"[bold yellow]Warning: Could not generate profile report. Error: {e}[/bold yellow]")
-        console.print("[bold yellow]Try installing setuptools: uv add setuptools[/bold yellow]")
-    console.print()
+    #     console.print(f"[bold green]Profile report saved to: {report_path}[/bold green]")
+    # except ImportError as e:
+    #     console.print(f"[bold yellow]Warning: Could not generate profile report. Error: {e}[/bold yellow]")
+    #     console.print("[bold yellow]Try installing setuptools: uv add setuptools[/bold yellow]")
+    # console.print()
 
 
 if __name__ == "__main__":
 
@@ -0,0 +1,124 @@
+#!/usr/bin/env python3
+"""
+Analyze collapse sections in the TBM dataset.
+
+This script provides detailed analysis of continuous collapse sections,
+including distribution, sizes, and recommendations for train/test splitting.
+
+Usage:
+    python scripts/analyze_collapse_sections.py
+"""
+
+import pandas as pd
+import matplotlib.pyplot as plt
+from pathlib import Path
+
+
+def analyze_collapse_sections(data_path: str = "data/model_ready/dataset_total.csv"):
+    """Analyze collapse sections in the dataset."""
+    
+    # Load data
+    df = pd.read_csv(data_path)
+    
+    print("=" * 80)
+    print("COLLAPSE SECTION ANALYSIS")
+    print("=" * 80)
+    
+    # Basic statistics
+    total_samples = len(df)
+    collapse_samples = df['collapse'].sum()
+    non_collapse_samples = (df['collapse'] == 0).sum()
+    
+    print(f"\nDataset Overview:")
+    print(f"  Total samples: {total_samples:,}")
+    print(f"  Collapse samples: {collapse_samples:,} ({collapse_samples/total_samples*100:.2f}%)")
+    print(f"  Non-collapse samples: {non_collapse_samples:,} ({non_collapse_samples/total_samples*100:.2f}%)")
+    
+    # Collapse section analysis
+    collapse_df = df[df['collapse_section'] > 0]
+    n_sections = collapse_df['collapse_section'].nunique()
+    
+    print(f"\nCollapse Sections:")
+    print(f"  Total continuous collapse sections: {n_sections}")
+    
+    # Section sizes
+    section_sizes = collapse_df.groupby('collapse_section').size().sort_values(ascending=False)
+    
+    print(f"\nSection Size Statistics:")
+    print(f"  Mean: {section_sizes.mean():.1f} samples")
+    print(f"  Median: {section_sizes.median():.1f} samples")
+    print(f"  Min: {section_sizes.min()} samples (Section {section_sizes.idxmin()})")
+    print(f"  Max: {section_sizes.max()} samples (Section {section_sizes.idxmax()})")
+    print(f"  Std Dev: {section_sizes.std():.1f} samples")
+    
+    print(f"\nAll Collapse Sections (sorted by size):")
+    print("-" * 40)
+    for section_id, size in section_sizes.items():
+        print(f"  Section {section_id:2d}: {size:3d} samples")
+    
+    # Visualization
+    print("\n" + "=" * 80)
+    print("Generating visualization...")
+    
+    fig, axes = plt.subplots(2, 1, figsize=(12, 8))
+    
+    # Plot 1: Bar chart of section sizes
+    ax1 = axes[0]
+    section_sizes_sorted = section_sizes.sort_index()
+    ax1.bar(section_sizes_sorted.index, section_sizes_sorted.values, 
+            color='#EF4927', alpha=0.7, edgecolor='black')
+    ax1.set_xlabel('Collapse Section ID', fontsize=11)
+    ax1.set_ylabel('Number of Samples', fontsize=11)
+    ax1.set_title('Collapse Section Sizes', fontsize=13, fontweight='bold')
+    ax1.grid(True, alpha=0.3, axis='y')
+    ax1.set_xticks(range(1, n_sections + 1))
+    
+    # Add mean line
+    ax1.axhline(y=section_sizes.mean(), color='blue', linestyle='--', 
+                linewidth=2, label=f'Mean: {section_sizes.mean():.1f}')
+    ax1.legend(fontsize=10)
+    
+    # Plot 2: Cumulative distribution
+    ax2 = axes[1]
+    cumsum_sorted = section_sizes_sorted.cumsum()
+    ax2.plot(cumsum_sorted.index, cumsum_sorted.values, 
+             marker='o', linewidth=2, markersize=6, color='#1B9AD7')
+    ax2.fill_between(cumsum_sorted.index, 0, cumsum_sorted.values, alpha=0.3, color='#1B9AD7')
+    ax2.set_xlabel('Collapse Section ID', fontsize=11)
+    ax2.set_ylabel('Cumulative Collapse Samples', fontsize=11)
+    ax2.set_title('Cumulative Distribution of Collapse Samples', fontsize=13, fontweight='bold')
+    ax2.grid(True, alpha=0.3)
+    ax2.set_xticks(range(1, n_sections + 1))
+    
+    # Add 75% line for train/test split reference
+    target_75 = collapse_samples * 0.75
+    ax2.axhline(y=target_75, color='red', linestyle='--', 
+                linewidth=2, label=f'75% split: {target_75:.0f} samples')
+    ax2.legend(fontsize=10)
+    
+    plt.tight_layout()
+    
+    # Save plot
+    output_path = Path("analyses") / "collapse_section_analysis.png"
+    output_path.parent.mkdir(exist_ok=True, parents=True)
+    plt.savefig(output_path, dpi=300, bbox_inches='tight')
+    print(f"Visualization saved to: {output_path}")
+    
+    # Save detailed report
+    report_path = Path("analyses") / "collapse_section_report.csv"
+    section_report = pd.DataFrame({
+        'collapse_section': section_sizes.index,
+        'n_samples': section_sizes.values,
+        'cumulative_samples': section_sizes.sort_index().cumsum().values,
+        'percentage_of_total_collapses': (section_sizes.values / collapse_samples * 100).round(2)
+    })
+    section_report.to_csv(report_path, index=False)
+    print(f"Detailed report saved to: {report_path}")
+    
+    print("\n" + "=" * 80)
+    print("Analysis complete!")
+    print("=" * 80)
+
+
+if __name__ == "__main__":
+    analyze_collapse_sections()