A comprehensive machine learning pipeline for early detection of sepsis using Gradient Boosting algorithms with advanced time-series feature engineering and imbalanced dataset handling.
This project implements a complete ML pipeline for sepsis early detection with the following key features:
- Gradient Boosting Models: Supports XGBoost, LightGBM, and CatBoost
- Time-Series Feature Engineering: Advanced temporal feature extraction from clinical time-series data
- Imbalanced Dataset Handling: Multiple techniques including SMOTE, ADASYN, and class weighting
- High Performance: Optimized for imbalanced clinical datasets with comprehensive evaluation metrics
SEPSIS DETECTION PROJECT/
│
├── data/
│ ├── raw/ # Raw input data
│ └── processed/ # Processed data
│
├── models/ # Saved models and feature selectors
│
├── src/
│ ├── features/ # Feature engineering modules
│ │ ├── time_series_features.py
│ │ └── feature_selector.py
│ ├── models/ # ML model implementations
│ │ ├── gradient_boosting_pipeline.py
│ │ └── imbalanced_handler.py
│ ├── utils/ # Utility functions
│ │ ├── config_loader.py
│ │ └── logger.py
│ └── data_loader.py # Data loading utilities
│
├── notebooks/ # Jupyter notebooks for exploration
│
├── configs/ # Configuration files
│ └── config.yaml
│
├── results/ # Evaluation results and plots
│
├── logs/ # Log files
│
├── train.py # Training script
├── evaluate.py # Evaluation script
├── inference.py # Inference script
├── generate_sample_data.py # Sample data generator
├── requirements.txt # Python dependencies
└── README.md # This file
# Clone or navigate to the project directory
cd "SEPSIS DETECTION PROJECT"
# Install dependencies
pip install -r requirements.txtIf you don't have your own data, you can generate sample data for testing:
python generate_sample_data.pyThis will create a sample dataset in data/raw/sepsis_data.csv with:
- 1000 patients
- ~15% sepsis rate (imbalanced dataset)
- Time-series vital signs and lab values
- Realistic clinical patterns
python train.pyThe training script will:
- Load and preprocess the data
- Perform time-series feature engineering
- Handle imbalanced data using SMOTE (or other methods)
- Train a Gradient Boosting model (XGBoost by default)
- Evaluate on test set
- Save the model and feature importance plots
python evaluate.pyThis will:
- Load the trained model
- Evaluate on the full dataset
- Generate ROC curves, PR curves, and confusion matrices
- Save evaluation metrics
python inference.py --data data/raw/sepsis_data.csv --output results/predictions.csvAll configuration is managed through configs/config.yaml. Key settings include:
- Algorithm: Choose between
xgboost,lightgbm, orcatboost - Hyperparameters: Learning rate, max depth, n_estimators, etc.
- Early stopping: Configure early stopping rounds
- Time window: Window size for rolling features (default: 6 hours)
- Lookback hours: How far back to look for features (default: 24 hours)
- Feature selection: Enable/disable and set max features
- Method: Choose from
smote,adasyn,smoteenn,class_weight, ornone - Sampling strategy: Control the resampling ratio
model:
algorithm: "xgboost"
learning_rate: 0.01
max_depth: 6
n_estimators: 1000
features:
time_window_hours: 6
lookback_hours: 24
feature_selection: true
max_features: 100
imbalanced_learning:
method: "smote"
k_neighbors: 5The pipeline extracts comprehensive temporal features:
-
Rolling Statistics
- Mean, std, min, max, median
- Percentiles (25th, 75th)
- Coefficient of variation
-
Trend Features
- Linear trend slope
- R-squared of trend
-
Change Features
- Absolute changes (1h, 3h, 6h)
- Percentage changes
- Rate of change
-
Statistical Features
- Recent statistics over lookback window
- Skewness and kurtosis
Multiple techniques are supported:
- SMOTE: Synthetic Minority Oversampling Technique
- ADASYN: Adaptive Synthetic Sampling
- SMOTEENN: SMOTE + Edited Nearest Neighbours
- Class Weights: Automatic class weight calculation
- None: Train without resampling
Three state-of-the-art gradient boosting algorithms:
- XGBoost: Extreme Gradient Boosting
- LightGBM: Light Gradient Boosting Machine
- CatBoost: Categorical Boosting
The pipeline evaluates models using:
- ROC-AUC: Area under the ROC curve
- Average Precision: Area under the PR curve
- Precision: Positive predictive value
- Recall: Sensitivity
- F1-Score: Harmonic mean of precision and recall
- Confusion Matrix: Detailed classification breakdown
Your input data should be a CSV file with the following columns:
patient_id: Unique patient identifiertime: Timestamp for each measurementsepsis_label: Binary target (0 = no sepsis, 1 = sepsis)
Any numeric columns will be used as features. Common clinical features include:
- Vital Signs: temperature, heart_rate, respiratory_rate, systolic_bp, oxygen_saturation
- Lab Values: wbc_count, lactate, creatinine, bilirubin
- Demographics: age, gender
- Clinical: icu_admission, hours_since_admission
patient_id,time,temperature,heart_rate,respiratory_rate,systolic_bp,oxygen_saturation,wbc_count,lactate,creatinine,bilirubin,age,gender,icu_admission,sepsis_label
0,2023-01-01 00:00:00,36.8,75,16,120,98,7.0,1.0,0.9,0.8,45,1,0,0
0,2023-01-01 01:00:00,37.2,78,17,118,97,7.2,1.1,0.9,0.8,45,1,0,0
...You can extend the feature engineering by modifying src/features/time_series_features.py:
from src.features import TimeSeriesFeatureEngineer
engineer = TimeSeriesFeatureEngineer(
time_window_hours=6,
lookback_hours=24
)
features = engineer.create_features(df)from src.models import GradientBoostingPipeline
pipeline = GradientBoostingPipeline(
algorithm='xgboost',
model_params={
'max_depth': 8,
'learning_rate': 0.01,
'n_estimators': 2000
}
)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)from src.models import ImbalancedDataHandler
handler = ImbalancedDataHandler(
method='smote',
k_neighbors=5
)
X_resampled, y_resampled = handler.fit_resample(X_train, y_train)After training and evaluation, results are saved in the results/ directory:
test_metrics.csv: Performance metrics on test setevaluation_metrics.csv: Full dataset evaluation metricsfeature_importance.png: Top 20 most important featuresconfusion_matrix.png: Confusion matrix visualizationroc_curve.png: ROC curve plotpr_curve.png: Precision-Recall curve plot
The included generate_sample_data.py creates realistic synthetic data:
python generate_sample_data.pyParameters can be adjusted:
n_patients: Number of patients (default: 1000)sepsis_rate: Proportion with sepsis (default: 0.15)hours_per_patient: Average hours of data (default: 48)
Key dependencies:
- numpy: Numerical computing
- pandas: Data manipulation
- scikit-learn: Machine learning utilities
- xgboost: XGBoost algorithm
- lightgbm: LightGBM algorithm
- catboost: CatBoost algorithm
- imbalanced-learn: Imbalanced dataset handling
- matplotlib/seaborn: Visualization
- scipy: Scientific computing
See requirements.txt for complete list.
The pipeline is optimized for imbalanced clinical datasets. Typical performance on imbalanced data (15% positive class):
- ROC-AUC: > 0.85
- Average Precision: > 0.70
- Recall: > 0.80 (high sensitivity for early detection)
- F1-Score: > 0.75
Note: Actual performance depends on data quality and characteristics
The pipeline automatically generates feature importance plots showing which features are most predictive of sepsis. This helps with:
- Model interpretability
- Feature selection
- Clinical understanding
-
Data Privacy: This is a research/educational project. Ensure compliance with healthcare data regulations (HIPAA, GDPR, etc.) when using real patient data.
-
Medical Disclaimer: This model is for research purposes only and should not be used for actual clinical decision-making without proper validation and regulatory approval.
-
Hyperparameter Tuning: The default hyperparameters are a good starting point, but you may want to tune them for your specific dataset.
-
Data Quality: Model performance heavily depends on data quality. Ensure your data is clean and properly formatted.
This is a complete project template. You can extend it by:
- Adding more feature engineering techniques
- Implementing additional models
- Adding hyperparameter tuning
- Creating visualization dashboards
- Adding unit tests
This project is provided as-is for educational and research purposes.
For questions or issues, please refer to the project documentation or create an issue in the repository.
Built with ❤️ for early sepsis detection