Skip to content

abdallah-912amf/GTC_ML_PROJECT_2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🩺 Diabetes Prediction – Machine Learning Project

📌 Overview

This project focuses on analyzing a diabetes dataset to uncover key patterns and relationships that might influence whether a patient is diabetic or not.

Dataset size: 768 records × 9 features


🔎 Phase 1: Become a Data Explorer!

Dive into the dataset and uncover the story within.
In this phase, we explored the following questions:

  • How many patients have diabetes versus those who don’t?

    • Non-Diabetic (0): 500
    • Diabetic (1): 268
  • What’s the relationship between glucose levels and the outcome?

    • Average Glucose (Non-Diabetic): 109.98
    • Average Glucose (Diabetic): 141.26
  • Does BMI play a significant role?

    • Average BMI (Non-Diabetic): 30.30
    • Average BMI (Diabetic): 35.14

📊 Visuals & Analysis

  • Bar charts comparing diabetic vs. non-diabetic patients.
  • Histograms of Glucose and BMI by outcome.
  • Correlation heatmap to visualize relationships between features.

✨ Key Insights from Phase 1

  • Patients with higher glucose levels and BMI are more likely to have diabetes.
  • No missing values or duplicate rows were found in the dataset.
  • Strong correlations exist between glucose, BMI, and diabetes outcome.

⚙️ Phase 2: Prep Your Data

Cleaning

  • Replaced unrealistic values (0) with NaN.
  • Applied imputation using the median.
  • ✅ Result: No missing values remain.

Train/Test Split

  • Training set: 614 samples (~80%)
  • Test set: 154 samples (~20%)
  • Used stratify to keep class distribution balanced (diabetic vs. non-diabetic).

Standardization

  • Applied StandardScaler to all features (mean = 0, std = 1).
  • Ensures all features are on the same scale.
  • 🔑 Especially important for models like Logistic Regression and SVM.

Standardization – Why?

Different features have very different ranges:

  • Glucose: ~ [50 – 200]
  • BMI: ~ [18 – 67]
  • Age: ~ [20 – 80]

⚠️ Models like Logistic Regression and SVM are sensitive to feature scales.
Without scaling, features with larger ranges (e.g., Glucose) would dominate the learning process.

✅ Solution: StandardScaler

  • Transforms all features so that:
    • Mean = 0
    • Standard Deviation = 1
  • This puts all features on the same scale, improving model performance and stability.

💡 Note on Imputation:

  • We used the median instead of the mean because the median is less sensitive to outliers,
    making it more robust for skewed data.

Phase 3: Model Building and Evaluation

In this phase, two classification models were built, tuned, and evaluated on the preprocessed dataset.
The goal was to train, tune, and compare models to identify the most effective approach.


1. Logistic Regression

  • Initial Performance

    • Accuracy: 70.77%
    • ROC AUC: 81.29%
  • Hyperparameter Tuning (GridSearchCV)

    • Best Parameters:
      {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
    • Cross-validation ROC AUC: 84.33%
  • Tuned Model Performance (Test Set)

    • Accuracy: 69.48%
    • ROC AUC: 81.27%

2. Random Forest

  • Initial Performance

    • Accuracy: 77.92%
    • ROC AUC: 81.91%
  • Hyperparameter Tuning (GridSearchCV)

    • Best Parameters:
      {'max_depth': 10, 'n_estimators': 200}
    • Cross-validation ROC AUC: 82.66%
  • Tuned Model Performance (Test Set)

    • Accuracy: 75.32%
    • ROC AUC: 81.51%

📊 Conclusion

  • Both the Logistic Regression and Random Forest models demonstrated strong predictive capabilities.
  • Random Forest showed slightly better accuracy, while Logistic Regression maintained competitive ROC AUC results after tuning.

🚀 Phase 4: Launch Your Prediction Engine

🎯 Objective

Deploy a prediction engine that takes new patient data as input and instantly classifies the individual as Diabetic or Non-Diabetic.


⚙️ Steps Implemented

  1. Model & Scaler Saving

    • Saved the preprocessing scaler (scaler.pkl) and the final model (diabetes_model.pkl).
    • The final deployed model was the Random Forest after hyperparameter tuning (best_rf_clf).
    • Although the pre-tuning Random Forest achieved a slightly higher accuracy (77.92% vs. 75.32%), the tuned version was chosen to ensure consistency with the ML workflow (training → tuning → deployment).
  2. Model & Scaler Loading

    • Reloaded the saved objects to ensure reproducibility and easy deployment.
  3. Prediction Function

    • Implemented predict_diabetes(input_data):
      • Accepts patient data (8 diagnostic features).
      • Applies scaling with the pre-fitted scaler.
      • Returns prediction:
        • 0 → Non-Diabetic
        • 1 → Diabetic
  4. Interactive Interface

    • Built a simple Gradio app to allow user-friendly input and real-time predictions.

📊 Results

  • The prediction engine works correctly for unseen patient data.
  • The integration with Gradio provides a clean, accessible interface for end-users.
  • This phase completes the full ML workflow: from EDA → Preprocessing → Model Training → Hyperparameter Tuning → Deployment.

🛠️ How to Use

  1. Clone the Repository

    git clone https://github.com/abdallah-912amf/GTC_ML_PROJECT_2.git
    cd GTC_ML_PROJECT_2
  2. Install Dependencies Make sure you have Python 3.10+ installed, then run:

    pip install -r requirements.txt
  3. Run the App

    python app/gradio_app.py
  4. Open in Browser After running the command, Gradio will show a local URL (e.g. http://127.0.0.1:7860).
    Open it in your browser to access the Diabetes Prediction App.


🧪 Example

You can try the app with the following patient data:

  • Patient A (expected Non-Diabetic):

    Pregnancies: 2
    Glucose: 120
    BloodPressure: 70
    SkinThickness: 30
    Insulin: 100
    BMI: 25.0
    DiabetesPedigreeFunction: 0.5
    Age: 30
    
  • Patient B (expected Diabetic):

    Pregnancies: 6
    Glucose: 160
    BloodPressure: 80
    SkinThickness: 35
    Insulin: 150
    BMI: 35.0
    DiabetesPedigreeFunction: 0.9
    Age: 50
    

The app will instantly classify the patient as Diabetic or Non-Diabetic.

Conclusion

This project demonstrated the complete machine learning workflow for a healthcare use case (diabetes prediction):

  • Phase 1 (EDA): Identified key factors such as glucose levels and BMI strongly correlated with diabetes.
  • Phase 2 (Preprocessing): Cleaned and standardized the dataset to ensure reliable model training.
  • Phase 3 (Modeling): Compared Logistic Regression and Random Forest models, showing Random Forest as the stronger candidate.
  • Phase 4 (Deployment): Built a prediction engine with a user-friendly interface using Gradio.

the project highlights the importance of data preparation, model comparison, and deployment in delivering an end-to-end ML solution.

About

This project focuses on analyzing a diabetes dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published