Survival analysis of heart failure patients to identify key factors that distinguish survival from death. Students will learn visualization, statistical analysis, and machine learning techniques to predict patient outcomes.
Key Finding: Two features are sufficient to distinguish survival from death using different classifiers.
Below is a high-level overview of the main components of this project.
Dataset {heart_failure_clinical_records_dataset.csv}
Heart failure clinical records from 299 patients with 13 features including age, ejection fraction, serum creatinine, and follow-up time. Binary outcome: survival or death.
Week 1: Data Exploration {Week1.ipynb}
Introduction to the dataset, exploratory data analysis, and visualization techniques using Pandas, Seaborn, and Matplotlib.
Week 2: Statistical Analysis {Week2.ipynb}
Hypothesis testing (T-test, Mann-Whitney U), correlation analysis, multiple testing correction (FDR), feature variance analysis, and Variance Inflation Factor (VIF) for detecting multicollinearity.
Week 3: Unsupervised Learning {Week3.ipynb}
Data normalization (Z-score), dimensionality reduction with PCA, K-Means clustering, hierarchical (agglomerative) clustering, confusion matrices, silhouette scores, and the elbow method.
Week 4: Supervised Learning {Week4.ipynb}
Train/test splitting with stratification, feature normalization, and classification algorithms including Logistic Regression, Random Forest, Support Vector Machines (SVM), and K-Nearest Neighbors (KNN). Model evaluation using accuracy, precision, recall, and F1-score metrics.
Week 5: Hyperparameter Optimization {Week5.ipynb}
Advanced techniques for tuning machine learning models using GridSearchCV, Random Search, and Bayesian Optimization with Optuna. Learn to efficiently search hyperparameter space and find optimal model configurations.
Week 6: Ensemble Methods & Boosting {Week6.ipynb}
From Random Forest to Gradient Boosting to LightGBM. Learn how ensemble methods combine weak learners into strong predictors. Includes hyperparameter tuning with Optuna, evaluation with 4 metrics, and hands-on exercises.
Week 7: Feature Selection Methods {Week7.ipynb}
Why fewer features often outperform more: Lasso (L1 penalty), Elastic Net (L1+L2), and MRMR filter method. Learn to identify which clinical features truly drive heart failure survival prediction.
Week 8: Deep Learning for Medical Data {Week8.ipynb}
Introduction to deep learning with Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTMs). Learn neural network fundamentals, backpropagation, activation functions, and how to apply deep learning to clinical prediction tasks.
Week 9: AI Model Interpretation (PLS-DA & SHAP) {Week9.ipynb}
Interpretability and explainability in machine learning. Partial Least Squares Discriminant Analysis (PLS-DA) for supervised dimensionality reduction, and SHAP (SHapley Additive exPlanations) for understanding model predictions and feature importance in black-box models.
Tutorials {Git_Tutorial.ipynb, Venv_Tutorial.ipynb}
Interactive Jupyter notebooks for Git basics and Python virtual environments. Also available as markdown guides in tutorials/ folder.
| Week | Topic | Links |
|---|---|---|
| 1 | Data Exploration | Notebook, Seaborn Docs, Pandas Docs |
| 2 | Statistical Analysis | Notebook, Slides, Scipy Stats, Statsmodels VIF |
| 3 | Unsupervised Learning | Notebook, Slides, PCA Guide, Clustering |
| 4 | Supervised Learning | Notebook, Slides, Scikit-learn Classifiers, Model Evaluation |
| 5 | Hyperparameter Optimization | Notebook, Slides, Optuna Docs, GridSearchCV |
| 6 | Ensemble Methods & Boosting | Notebook, Slides, LightGBM Docs, Chicco & Jurman (2020) |
| 7 | Feature Selection Methods | Notebook, Scikit-learn Feature Selection |
| 8 | Deep Learning for Medical Data | Notebook, TensorFlow Docs, Keras API, Understanding Neural Networks |
| 9 | AI Model Interpretation (PLS-DA & SHAP) | Notebook, Slides, SHAP Docs, PLS-DA Guide |
This project is based on the paper by Chicco & Jurman (2020):
Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone BMC Medical Informatics and Decision Making, 20, 16
Key Findings from the Paper:
- Applied several ML classifiers (Random Forest, Gradient Boosting, SVM, etc.) to predict survival
- Discovered that serum creatinine and ejection fraction alone achieve strong predictive performance
- Random Forest achieved the best results with Matthews Correlation Coefficient (MCC) of 0.418
- Feature ranking analysis revealed time, serum creatinine, and ejection fraction as top predictors
- Demonstrated that complex models with all 13 features do not significantly outperform simpler 2-feature models
Clinical Relevance:
- Serum creatinine indicates kidney function, often impaired in heart failure patients
- Ejection fraction measures heart pumping efficiency, a direct indicator of cardiac health
- These two biomarkers are routinely measured and can guide clinical decision-making
- Git Tutorial - Installation, setup, commands, workflows, branching
- Virtual Environment Tutorial - Python venv, package management, troubleshooting
- Scikit-learn - Machine learning
- Pandas - Data manipulation
- Seaborn - Data visualization
- LightGBM - Gradient boosting
- Optuna - Hyperparameter optimization
git clone https://github.com/MichiganDataScienceTeam/W26-MDST-Project_Heart-Failure-Survival-Analysis.git
cd W26-MDST-Project_Heart-Failure-Survival-Analysis
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
jupyter notebookNew to Git? Start with Git_Tutorial.ipynb
New to Python virtual environments? Check Venv_Tutorial.ipynb
Both tutorials are interactive Jupyter notebooks - just open and read along!
Leads
Sina Bonakdar
Terry Zhang
This project is licensed under the MIT License - see the LICENSE file for details.
