Skip to content

Predicting likelihood of getting Diabetes and various parameters using Logistic regression and Decision tree regressor

Notifications You must be signed in to change notification settings

SamitUttarkar/Diabates-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏥_Diabetes_Prediction

Task

  • Explain the issues in the data with data cleaning and preperation
  • Predicting if the patient is going to develop diabetes based on three or more children indicator i.e. if the mother has more than three children or not and then calculating the probability of developing diabetes given the mother has more than three children and vice verca
  • Predicting if the patient is going to develop diabetes based on multiple parameters and choosing the best model to predict it using ToPredict dataset as testing dataset

Data Preperation

The first task is the analyse the data and performe some data cleaning steps

Screenshot 2023-01-28 at 8 02 20 PM

-From the above figure we can visualise that there are many zeros in the column Insulin and Skinthickness. It is not possible to get Insulin and Skinthickness as zero therefore I decided the drop these columns as they will not be useful in the prediction process

  • We also replace the missing values with the median for the rest of the columns

EDA

First, let's check for the correlation between the parameters

Screenshot 2023-01-28 at 8 12 25 PM

The correlation between the paramters is not higher than 0.7 which is good as it will help in predicting.

Now, for the bivariate analysis

Screenshot 2023-01-28 at 8 10 57 PM

From the pairplot it's difficult the classify based on the scatterplots

Machine Learning

Predicting using Three or more kids parameter

In the first step we will create a column called threeormore which indicated wether the patient has more than three children or not Then we calculate probability after the model fitting using logistic regression

Screenshot 2023-01-28 at 8 16 55 PM

Bayes rule was used to calculate probability for this step

Predicting using multiple paramateres

First we need to check which model performs the best

Screenshot 2023-01-28 at 8 19 17 PM

We can see that logistic regression performs better than DecisionTreeRegressor therefore we will use it for our further prediction

Confusion Matrix using all the parameters for training

Screenshot 2023-01-28 at 8 20 17 PM

Feature Selection

Screenshot 2023-01-28 at 8 22 50 PM

Screenshot 2023-01-28 at 8 22 59 PM

Final Probability of getting Diabetes

Screenshot 2023-01-28 at 8 24 13 PM

About the dataset

The data used to create the dataset PimaDiabetes.cv, which is used in the coursework, was originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases in the United States. It includes a 0/1 variable, Outcome, which indicates whether the subject ultimately tested positive for diabetes, along with a list of numerous diagnostic measures recorded from 750 women.

About

Predicting likelihood of getting Diabetes and various parameters using Logistic regression and Decision tree regressor

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published