🩺 Diabetes Prediction – Machine Learning Project

📌 Overview

This project focuses on analyzing a diabetes dataset to uncover key patterns and relationships that might influence whether a patient is diabetic or not.

Dataset size: 768 records × 9 features

🔎 Phase 1: Become a Data Explorer!

Dive into the dataset and uncover the story within.
In this phase, we explored the following questions:

How many patients have diabetes versus those who don’t?
- Non-Diabetic (0): 500
- Diabetic (1): 268
What’s the relationship between glucose levels and the outcome?
- Average Glucose (Non-Diabetic): 109.98
- Average Glucose (Diabetic): 141.26
Does BMI play a significant role?
- Average BMI (Non-Diabetic): 30.30
- Average BMI (Diabetic): 35.14

📊 Visuals & Analysis

Bar charts comparing diabetic vs. non-diabetic patients.
Histograms of Glucose and BMI by outcome.
Correlation heatmap to visualize relationships between features.

✨ Key Insights from Phase 1

Patients with higher glucose levels and BMI are more likely to have diabetes.
No missing values or duplicate rows were found in the dataset.
Strong correlations exist between glucose, BMI, and diabetes outcome.

⚙️ Phase 2: Prep Your Data

Cleaning

Replaced unrealistic values (0) with NaN.
Applied imputation using the median.
✅ Result: No missing values remain.

Train/Test Split

Training set: 614 samples (~80%)
Test set: 154 samples (~20%)
Used stratify to keep class distribution balanced (diabetic vs. non-diabetic).

Standardization

Applied StandardScaler to all features (mean = 0, std = 1).
Ensures all features are on the same scale.
🔑 Especially important for models like Logistic Regression and SVM.

Standardization – Why?

Different features have very different ranges:

Glucose: ~ [50 – 200]
BMI: ~ [18 – 67]
Age: ~ [20 – 80]

⚠️ Models like Logistic Regression and SVM are sensitive to feature scales.
Without scaling, features with larger ranges (e.g., Glucose) would dominate the learning process.

✅ Solution: StandardScaler

Transforms all features so that:
- Mean = 0
- Standard Deviation = 1
This puts all features on the same scale, improving model performance and stability.

💡 Note on Imputation:

We used the median instead of the mean because the median is less sensitive to outliers,
making it more robust for skewed data.

Phase 3: Model Building and Evaluation

In this phase, two classification models were built, tuned, and evaluated on the preprocessed dataset.
The goal was to train, tune, and compare models to identify the most effective approach.

1. Logistic Regression

Initial Performance
- Accuracy: 70.77%
- ROC AUC: 81.29%
Hyperparameter Tuning (GridSearchCV)
- Best Parameters:
```
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
```
- Cross-validation ROC AUC: 84.33%
Tuned Model Performance (Test Set)
- Accuracy: 69.48%
- ROC AUC: 81.27%

2. Random Forest

Initial Performance
- Accuracy: 77.92%
- ROC AUC: 81.91%
Hyperparameter Tuning (GridSearchCV)
- Best Parameters:
```
{'max_depth': 10, 'n_estimators': 200}
```
- Cross-validation ROC AUC: 82.66%
Tuned Model Performance (Test Set)
- Accuracy: 75.32%
- ROC AUC: 81.51%

📊 Conclusion

Both the Logistic Regression and Random Forest models demonstrated strong predictive capabilities.
Random Forest showed slightly better accuracy, while Logistic Regression maintained competitive ROC AUC results after tuning.

🚀 Phase 4: Launch Your Prediction Engine

🎯 Objective

Deploy a prediction engine that takes new patient data as input and instantly classifies the individual as Diabetic or Non-Diabetic.

⚙️ Steps Implemented

Model & Scaler Saving
- Saved the preprocessing scaler (scaler.pkl) and the final model (diabetes_model.pkl).
- The final deployed model was the Random Forest after hyperparameter tuning (best_rf_clf).
- Although the pre-tuning Random Forest achieved a slightly higher accuracy (77.92% vs. 75.32%), the tuned version was chosen to ensure consistency with the ML workflow (training → tuning → deployment).
Model & Scaler Loading
- Reloaded the saved objects to ensure reproducibility and easy deployment.
Prediction Function
- Implemented predict_diabetes(input_data):
  - Accepts patient data (8 diagnostic features).
  - Applies scaling with the pre-fitted scaler.
  - Returns prediction:
    - 0 → Non-Diabetic
    - 1 → Diabetic
Interactive Interface
- Built a simple Gradio app to allow user-friendly input and real-time predictions.

📊 Results

The prediction engine works correctly for unseen patient data.
The integration with Gradio provides a clean, accessible interface for end-users.
This phase completes the full ML workflow: from EDA → Preprocessing → Model Training → Hyperparameter Tuning → Deployment.

🛠️ How to Use

Clone the Repository

git clone https://github.com/abdallah-912amf/GTC_ML_PROJECT_2.git
cd GTC_ML_PROJECT_2

Install Dependencies Make sure you have Python 3.10+ installed, then run:
```
pip install -r requirements.txt
```
Run the App
```
python app/gradio_app.py
```
Open in Browser After running the command, Gradio will show a local URL (e.g. http://127.0.0.1:7860).
Open it in your browser to access the Diabetes Prediction App.

🧪 Example

You can try the app with the following patient data:

Patient A (expected Non-Diabetic):

Pregnancies: 2
Glucose: 120
BloodPressure: 70
SkinThickness: 30
Insulin: 100
BMI: 25.0
DiabetesPedigreeFunction: 0.5
Age: 30

Patient B (expected Diabetic):

Pregnancies: 6
Glucose: 160
BloodPressure: 80
SkinThickness: 35
Insulin: 150
BMI: 35.0
DiabetesPedigreeFunction: 0.9
Age: 50

The app will instantly classify the patient as Diabetic or Non-Diabetic.

Conclusion

This project demonstrated the complete machine learning workflow for a healthcare use case (diabetes prediction):

Phase 1 (EDA): Identified key factors such as glucose levels and BMI strongly correlated with diabetes.
Phase 2 (Preprocessing): Cleaned and standardized the dataset to ensure reliable model training.
Phase 3 (Modeling): Compared Logistic Regression and Random Forest models, showing Random Forest as the stronger candidate.
Phase 4 (Deployment): Built a prediction engine with a user-friendly interface using Gradio.

the project highlights the importance of data preparation, model comparison, and deployment in delivering an end-to-end ML solution.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
data		data
docs		docs
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🩺 Diabetes Prediction – Machine Learning Project

📌 Overview

🔎 Phase 1: Become a Data Explorer!

📊 Visuals & Analysis

✨ Key Insights from Phase 1

⚙️ Phase 2: Prep Your Data

Cleaning

Train/Test Split

Standardization

Standardization – Why?

Phase 3: Model Building and Evaluation

1. Logistic Regression

2. Random Forest

📊 Conclusion

🚀 Phase 4: Launch Your Prediction Engine

🎯 Objective

⚙️ Steps Implemented

📊 Results

🛠️ How to Use

🧪 Example

Conclusion

About

Uh oh!

Releases

Packages

Languages

abdallah-912amf/GTC_ML_PROJECT_2

Folders and files

Latest commit

History

Repository files navigation

🩺 Diabetes Prediction – Machine Learning Project

📌 Overview

🔎 Phase 1: Become a Data Explorer!

📊 Visuals & Analysis

✨ Key Insights from Phase 1

⚙️ Phase 2: Prep Your Data

Cleaning

Train/Test Split

Standardization

Standardization – Why?

Phase 3: Model Building and Evaluation

1. Logistic Regression

2. Random Forest

📊 Conclusion

🚀 Phase 4: Launch Your Prediction Engine

🎯 Objective

⚙️ Steps Implemented

📊 Results

🛠️ How to Use

🧪 Example

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages