Skip to content

IlyasFardaouix/Sepsis-Detection

Repository files navigation

Sepsis Early Detection Project

A comprehensive machine learning pipeline for early detection of sepsis using Gradient Boosting algorithms with advanced time-series feature engineering and imbalanced dataset handling.

🎯 Project Overview

This project implements a complete ML pipeline for sepsis early detection with the following key features:

  • Gradient Boosting Models: Supports XGBoost, LightGBM, and CatBoost
  • Time-Series Feature Engineering: Advanced temporal feature extraction from clinical time-series data
  • Imbalanced Dataset Handling: Multiple techniques including SMOTE, ADASYN, and class weighting
  • High Performance: Optimized for imbalanced clinical datasets with comprehensive evaluation metrics

📁 Project Structure

SEPSIS DETECTION PROJECT/
│
├── data/
│   ├── raw/                    # Raw input data
│   └── processed/              # Processed data
│
├── models/                     # Saved models and feature selectors
│
├── src/
│   ├── features/               # Feature engineering modules
│   │   ├── time_series_features.py
│   │   └── feature_selector.py
│   ├── models/                 # ML model implementations
│   │   ├── gradient_boosting_pipeline.py
│   │   └── imbalanced_handler.py
│   ├── utils/                  # Utility functions
│   │   ├── config_loader.py
│   │   └── logger.py
│   └── data_loader.py          # Data loading utilities
│
├── notebooks/                  # Jupyter notebooks for exploration
│
├── configs/                    # Configuration files
│   └── config.yaml
│
├── results/                     # Evaluation results and plots
│
├── logs/                       # Log files
│
├── train.py                    # Training script
├── evaluate.py                 # Evaluation script
├── inference.py                # Inference script
├── generate_sample_data.py    # Sample data generator
├── requirements.txt            # Python dependencies
└── README.md                   # This file

🚀 Quick Start

1. Installation

# Clone or navigate to the project directory
cd "SEPSIS DETECTION PROJECT"

# Install dependencies
pip install -r requirements.txt

2. Generate Sample Data

If you don't have your own data, you can generate sample data for testing:

python generate_sample_data.py

This will create a sample dataset in data/raw/sepsis_data.csv with:

  • 1000 patients
  • ~15% sepsis rate (imbalanced dataset)
  • Time-series vital signs and lab values
  • Realistic clinical patterns

3. Train the Model

python train.py

The training script will:

  • Load and preprocess the data
  • Perform time-series feature engineering
  • Handle imbalanced data using SMOTE (or other methods)
  • Train a Gradient Boosting model (XGBoost by default)
  • Evaluate on test set
  • Save the model and feature importance plots

4. Evaluate the Model

python evaluate.py

This will:

  • Load the trained model
  • Evaluate on the full dataset
  • Generate ROC curves, PR curves, and confusion matrices
  • Save evaluation metrics

5. Make Predictions

python inference.py --data data/raw/sepsis_data.csv --output results/predictions.csv

⚙️ Configuration

All configuration is managed through configs/config.yaml. Key settings include:

Model Configuration

  • Algorithm: Choose between xgboost, lightgbm, or catboost
  • Hyperparameters: Learning rate, max depth, n_estimators, etc.
  • Early stopping: Configure early stopping rounds

Feature Engineering

  • Time window: Window size for rolling features (default: 6 hours)
  • Lookback hours: How far back to look for features (default: 24 hours)
  • Feature selection: Enable/disable and set max features

Imbalanced Learning

  • Method: Choose from smote, adasyn, smoteenn, class_weight, or none
  • Sampling strategy: Control the resampling ratio

Example Configuration

model:
  algorithm: "xgboost"
  learning_rate: 0.01
  max_depth: 6
  n_estimators: 1000

features:
  time_window_hours: 6
  lookback_hours: 24
  feature_selection: true
  max_features: 100

imbalanced_learning:
  method: "smote"
  k_neighbors: 5

📊 Features

Time-Series Feature Engineering

The pipeline extracts comprehensive temporal features:

  1. Rolling Statistics

    • Mean, std, min, max, median
    • Percentiles (25th, 75th)
    • Coefficient of variation
  2. Trend Features

    • Linear trend slope
    • R-squared of trend
  3. Change Features

    • Absolute changes (1h, 3h, 6h)
    • Percentage changes
    • Rate of change
  4. Statistical Features

    • Recent statistics over lookback window
    • Skewness and kurtosis

Imbalanced Dataset Handling

Multiple techniques are supported:

  • SMOTE: Synthetic Minority Oversampling Technique
  • ADASYN: Adaptive Synthetic Sampling
  • SMOTEENN: SMOTE + Edited Nearest Neighbours
  • Class Weights: Automatic class weight calculation
  • None: Train without resampling

Model Algorithms

Three state-of-the-art gradient boosting algorithms:

  1. XGBoost: Extreme Gradient Boosting
  2. LightGBM: Light Gradient Boosting Machine
  3. CatBoost: Categorical Boosting

📈 Evaluation Metrics

The pipeline evaluates models using:

  • ROC-AUC: Area under the ROC curve
  • Average Precision: Area under the PR curve
  • Precision: Positive predictive value
  • Recall: Sensitivity
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed classification breakdown

📝 Data Format

Your input data should be a CSV file with the following columns:

Required Columns

  • patient_id: Unique patient identifier
  • time: Timestamp for each measurement
  • sepsis_label: Binary target (0 = no sepsis, 1 = sepsis)

Feature Columns

Any numeric columns will be used as features. Common clinical features include:

  • Vital Signs: temperature, heart_rate, respiratory_rate, systolic_bp, oxygen_saturation
  • Lab Values: wbc_count, lactate, creatinine, bilirubin
  • Demographics: age, gender
  • Clinical: icu_admission, hours_since_admission

Example Data Format

patient_id,time,temperature,heart_rate,respiratory_rate,systolic_bp,oxygen_saturation,wbc_count,lactate,creatinine,bilirubin,age,gender,icu_admission,sepsis_label
0,2023-01-01 00:00:00,36.8,75,16,120,98,7.0,1.0,0.9,0.8,45,1,0,0
0,2023-01-01 01:00:00,37.2,78,17,118,97,7.2,1.1,0.9,0.8,45,1,0,0
...

🔧 Advanced Usage

Custom Feature Engineering

You can extend the feature engineering by modifying src/features/time_series_features.py:

from src.features import TimeSeriesFeatureEngineer

engineer = TimeSeriesFeatureEngineer(
    time_window_hours=6,
    lookback_hours=24
)

features = engineer.create_features(df)

Custom Model Training

from src.models import GradientBoostingPipeline

pipeline = GradientBoostingPipeline(
    algorithm='xgboost',
    model_params={
        'max_depth': 8,
        'learning_rate': 0.01,
        'n_estimators': 2000
    }
)

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Handling Imbalanced Data

from src.models import ImbalancedDataHandler

handler = ImbalancedDataHandler(
    method='smote',
    k_neighbors=5
)

X_resampled, y_resampled = handler.fit_resample(X_train, y_train)

📊 Results

After training and evaluation, results are saved in the results/ directory:

  • test_metrics.csv: Performance metrics on test set
  • evaluation_metrics.csv: Full dataset evaluation metrics
  • feature_importance.png: Top 20 most important features
  • confusion_matrix.png: Confusion matrix visualization
  • roc_curve.png: ROC curve plot
  • pr_curve.png: Precision-Recall curve plot

🧪 Testing with Sample Data

The included generate_sample_data.py creates realistic synthetic data:

python generate_sample_data.py

Parameters can be adjusted:

  • n_patients: Number of patients (default: 1000)
  • sepsis_rate: Proportion with sepsis (default: 0.15)
  • hours_per_patient: Average hours of data (default: 48)

📚 Dependencies

Key dependencies:

  • numpy: Numerical computing
  • pandas: Data manipulation
  • scikit-learn: Machine learning utilities
  • xgboost: XGBoost algorithm
  • lightgbm: LightGBM algorithm
  • catboost: CatBoost algorithm
  • imbalanced-learn: Imbalanced dataset handling
  • matplotlib/seaborn: Visualization
  • scipy: Scientific computing

See requirements.txt for complete list.

🎓 Model Performance

The pipeline is optimized for imbalanced clinical datasets. Typical performance on imbalanced data (15% positive class):

  • ROC-AUC: > 0.85
  • Average Precision: > 0.70
  • Recall: > 0.80 (high sensitivity for early detection)
  • F1-Score: > 0.75

Note: Actual performance depends on data quality and characteristics

🔍 Feature Importance

The pipeline automatically generates feature importance plots showing which features are most predictive of sepsis. This helps with:

  • Model interpretability
  • Feature selection
  • Clinical understanding

🚨 Important Notes

  1. Data Privacy: This is a research/educational project. Ensure compliance with healthcare data regulations (HIPAA, GDPR, etc.) when using real patient data.

  2. Medical Disclaimer: This model is for research purposes only and should not be used for actual clinical decision-making without proper validation and regulatory approval.

  3. Hyperparameter Tuning: The default hyperparameters are a good starting point, but you may want to tune them for your specific dataset.

  4. Data Quality: Model performance heavily depends on data quality. Ensure your data is clean and properly formatted.

🤝 Contributing

This is a complete project template. You can extend it by:

  • Adding more feature engineering techniques
  • Implementing additional models
  • Adding hyperparameter tuning
  • Creating visualization dashboards
  • Adding unit tests

📄 License

This project is provided as-is for educational and research purposes.

📧 Contact

For questions or issues, please refer to the project documentation or create an issue in the repository.


Built with ❤️ for early sepsis detection

About

Machine learning pipeline for early sepsis detection using gradient boosting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages