This project focuses on analyzing a diabetes dataset to uncover key patterns and relationships that might influence whether a patient is diabetic or not.
Dataset size: 768 records × 9 features
Dive into the dataset and uncover the story within.
In this phase, we explored the following questions:
-
How many patients have diabetes versus those who don’t?
- Non-Diabetic (0): 500
- Diabetic (1): 268
-
What’s the relationship between glucose levels and the outcome?
- Average Glucose (Non-Diabetic): 109.98
- Average Glucose (Diabetic): 141.26
-
Does BMI play a significant role?
- Average BMI (Non-Diabetic): 30.30
- Average BMI (Diabetic): 35.14
- Bar charts comparing diabetic vs. non-diabetic patients.
- Histograms of Glucose and BMI by outcome.
- Correlation heatmap to visualize relationships between features.
- Patients with higher glucose levels and BMI are more likely to have diabetes.
- No missing values or duplicate rows were found in the dataset.
- Strong correlations exist between glucose, BMI, and diabetes outcome.
- Replaced unrealistic values (0) with NaN.
- Applied imputation using the median.
- ✅ Result: No missing values remain.
- Training set: 614 samples (~80%)
- Test set: 154 samples (~20%)
- Used stratify to keep class distribution balanced (diabetic vs. non-diabetic).
- Applied StandardScaler to all features (mean = 0, std = 1).
- Ensures all features are on the same scale.
- 🔑 Especially important for models like Logistic Regression and SVM.
Different features have very different ranges:
- Glucose: ~ [50 – 200]
- BMI: ~ [18 – 67]
- Age: ~ [20 – 80]
Without scaling, features with larger ranges (e.g., Glucose) would dominate the learning process.
✅ Solution: StandardScaler
- Transforms all features so that:
- Mean = 0
- Standard Deviation = 1
- This puts all features on the same scale, improving model performance and stability.
💡 Note on Imputation:
- We used the median instead of the mean because the median is less sensitive to outliers,
making it more robust for skewed data.
In this phase, two classification models were built, tuned, and evaluated on the preprocessed dataset.
The goal was to train, tune, and compare models to identify the most effective approach.
-
Initial Performance
- Accuracy: 70.77%
- ROC AUC: 81.29%
-
Hyperparameter Tuning (GridSearchCV)
- Best Parameters:
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
- Cross-validation ROC AUC: 84.33%
- Best Parameters:
-
Tuned Model Performance (Test Set)
- Accuracy: 69.48%
- ROC AUC: 81.27%
-
Initial Performance
- Accuracy: 77.92%
- ROC AUC: 81.91%
-
Hyperparameter Tuning (GridSearchCV)
- Best Parameters:
{'max_depth': 10, 'n_estimators': 200}
- Cross-validation ROC AUC: 82.66%
- Best Parameters:
-
Tuned Model Performance (Test Set)
- Accuracy: 75.32%
- ROC AUC: 81.51%
- Both the Logistic Regression and Random Forest models demonstrated strong predictive capabilities.
- Random Forest showed slightly better accuracy, while Logistic Regression maintained competitive ROC AUC results after tuning.
Deploy a prediction engine that takes new patient data as input and instantly classifies the individual as Diabetic or Non-Diabetic.
-
Model & Scaler Saving
- Saved the preprocessing scaler (
scaler.pkl
) and the final model (diabetes_model.pkl
). - The final deployed model was the Random Forest after hyperparameter tuning (
best_rf_clf
). - Although the pre-tuning Random Forest achieved a slightly higher accuracy (77.92% vs. 75.32%), the tuned version was chosen to ensure consistency with the ML workflow (training → tuning → deployment).
- Saved the preprocessing scaler (
-
Model & Scaler Loading
- Reloaded the saved objects to ensure reproducibility and easy deployment.
-
Prediction Function
- Implemented
predict_diabetes(input_data)
:- Accepts patient data (8 diagnostic features).
- Applies scaling with the pre-fitted scaler.
- Returns prediction:
0
→ Non-Diabetic1
→ Diabetic
- Implemented
-
Interactive Interface
- Built a simple Gradio app to allow user-friendly input and real-time predictions.
- The prediction engine works correctly for unseen patient data.
- The integration with Gradio provides a clean, accessible interface for end-users.
- This phase completes the full ML workflow: from EDA → Preprocessing → Model Training → Hyperparameter Tuning → Deployment.
-
Clone the Repository
git clone https://github.com/abdallah-912amf/GTC_ML_PROJECT_2.git cd GTC_ML_PROJECT_2
-
Install Dependencies Make sure you have Python 3.10+ installed, then run:
pip install -r requirements.txt
-
Run the App
python app/gradio_app.py
-
Open in Browser After running the command, Gradio will show a local URL (e.g.
http://127.0.0.1:7860
).
Open it in your browser to access the Diabetes Prediction App.
You can try the app with the following patient data:
-
Patient A (expected Non-Diabetic):
Pregnancies: 2 Glucose: 120 BloodPressure: 70 SkinThickness: 30 Insulin: 100 BMI: 25.0 DiabetesPedigreeFunction: 0.5 Age: 30
-
Patient B (expected Diabetic):
Pregnancies: 6 Glucose: 160 BloodPressure: 80 SkinThickness: 35 Insulin: 150 BMI: 35.0 DiabetesPedigreeFunction: 0.9 Age: 50
The app will instantly classify the patient as Diabetic or Non-Diabetic.
This project demonstrated the complete machine learning workflow for a healthcare use case (diabetes prediction):
- Phase 1 (EDA): Identified key factors such as glucose levels and BMI strongly correlated with diabetes.
- Phase 2 (Preprocessing): Cleaned and standardized the dataset to ensure reliable model training.
- Phase 3 (Modeling): Compared Logistic Regression and Random Forest models, showing Random Forest as the stronger candidate.
- Phase 4 (Deployment): Built a prediction engine with a user-friendly interface using Gradio.
the project highlights the importance of data preparation, model comparison, and deployment in delivering an end-to-end ML solution.