This project aims to build a machine learning model that predicts whether a diabetic patient is likely to be readmitted to the hospital within 30 days. Preventing avoidable readmissions is vital for improving patient outcomes and reducing healthcare costs. By leveraging structured medical records and a feature importance analysis, we provide both accurate predictions and actionable insights.
Hospital readmissions are costly and often avoidable. This project develops a predictive solution that identifies high-risk patients at discharge, enabling hospitals to intervene proactively. Our model uses features from patient demographics, hospitalization history, medications, diagnoses, and more.
The dataset used is sourced from the UCI Machine Learning Repository, containing over 100,000 hospital encounter records for diabetic patients from 130 U.S. hospitals.
Key characteristics:
- 50 features
- Categorical and numeric variables
- Target variable:
readmitted(within 30 days, after 30 days, or not at all)
- Removed irrelevant or high-missing columns (
weight,payer_code,medical_specialty) - Filtered invalid or ambiguous entries (e.g., unknown gender, expired patients)
- Consolidated category levels using mappings provided in the dataset
- Assessed feature distributions, class imbalance, and correlations
- Investigated relationships between patient demographics and readmission
- Created new features such as:
total_visitsis_chronic_patientnum_medication_changes
- Encoded ordinal/categorical variables appropriately
- Standardized numeric features
- Applied one-hot encoding where necessary
- Used Logistic Regression as a baseline
- Trained advanced models: Decision Tree, Random Forest, and XGBoost
- Applied random oversampling to balance the classes
- Evaluated models using:
- Accuracy
- Recall
- F1-score
- ROC-AUC
- Classification report
- Used model-native feature importance metrics (e.g.,
.feature_importances_,coef_) - Identified top predictors of readmission risk
- Visualized feature importance to interpret global model behavior
- Used insights to support real-world decision-making in healthcare operations
- Language: Python
- Libraries: pandas, NumPy, scikit-learn, matplotlib, seaborn
- Tools: Jupyter Notebook / VSCode
project_directory/
│
├── Data/
│ └── diabetic_data.csv # Main dataset containing patient hospital records
└── IDS_mapping.csv # Mapping file for categorical ID fields
├── index.ipynb # Jupyter Notebook containing the full ML pipeline
├── README.md # Project overview and documentation (this file)
├── presentation.pdf # Non-technical presentation
- Features like
num_medications,time_in_hospital, andagewere among the most predictive - Patients with more procedures and chronic diagnoses showed higher readmission risk
- Certain discharge types and gender patterns were also statistically significant
This model can be integrated into hospital EHR systems to:
- Flag high-risk patients before discharge
- Optimize care plans and follow-ups
- Reduce costs from preventable readmissions
- Improve patient satisfaction and clinical outcomes
- Deploy model via a Streamlit dashboard
- Explore cost-sensitive learning techniques
- Collect more recent or real-time data from hospitals
- Expand scope to other chronic conditions